# Python but primarily pandas review

### The pandas library and DataFrames

The first thing you'll usually do when writing Python code is to import any libraries that you'll need. When working with data files, the pandas library offers all sorts of convenient functionality.

In [1]:
import pandas as pd

We will use read_csv to import data from text files. This assumes that you have a folder called data and the file is placed in that folder. Do you?  

What does read_csv do for you? What does csv mean?

In [2]:
# You can do this in one or in two steps.  First, specify where the file is located:
path_to_data = './data/gapminder.tsv'
#if you are not using a data directory:
#path_to_data = 'gapminder.tsv'
# Now import the file
data = pd.read_csv(path_to_data, sep = '\t')

What went wrong?

ALWAYS, <b>ALWAYS</b> look at the data

In [3]:
data.head()

Unnamed: 0,country,continent,year,lifeExp,pop,gdpPercap
0,Afghanistan,Asia,1952,28.801,8425333,779.445314
1,Afghanistan,Asia,1957,30.332,9240934,820.85303
2,Afghanistan,Asia,1962,31.997,10267083,853.10071
3,Afghanistan,Asia,1967,34.02,11537966,836.197138
4,Afghanistan,Asia,1972,36.088,13079460,739.981106


In [4]:
data.tail()

Unnamed: 0,country,continent,year,lifeExp,pop,gdpPercap
1699,Zimbabwe,Africa,1987,62.351,9216418,706.157306
1700,Zimbabwe,Africa,1992,60.377,10704340,693.420786
1701,Zimbabwe,Africa,1997,46.809,11404948,792.44996
1702,Zimbabwe,Africa,2002,39.989,11926563,672.038623
1703,Zimbabwe,Africa,2007,43.487,12311143,469.709298


In [5]:
data.shape

(1704, 6)

In [6]:
data.size

10224

Why do head and tail have parentheses but shape and size do not?

### Getting a sense of your data

In [7]:
data.dtypes

country       object
continent     object
year           int64
lifeExp      float64
pop            int64
gdpPercap    float64
dtype: object

In [8]:
data.columns

Index(['country', 'continent', 'year', 'lifeExp', 'pop', 'gdpPercap'], dtype='object')

In [9]:
data.describe() #Notice anything????

Unnamed: 0,year,lifeExp,pop,gdpPercap
count,1704.0,1704.0,1704.0,1704.0
mean,1979.5,59.474439,29601210.0,7215.327081
std,17.26533,12.917107,106157900.0,9857.454543
min,1952.0,23.599,60011.0,241.165876
25%,1965.75,48.198,2793664.0,1202.060309
50%,1979.5,60.7125,7023596.0,3531.846988
75%,1993.25,70.8455,19585220.0,9325.462346
max,2007.0,82.603,1318683000.0,113523.1329


#### Column level data

In [10]:
data['country']

0       Afghanistan
1       Afghanistan
2       Afghanistan
3       Afghanistan
4       Afghanistan
           ...     
1699       Zimbabwe
1700       Zimbabwe
1701       Zimbabwe
1702       Zimbabwe
1703       Zimbabwe
Name: country, Length: 1704, dtype: object

In [11]:
data['country'].unique()

array(['Afghanistan', 'Albania', 'Algeria', 'Angola', 'Argentina',
       'Australia', 'Austria', 'Bahrain', 'Bangladesh', 'Belgium',
       'Benin', 'Bolivia', 'Bosnia and Herzegovina', 'Botswana', 'Brazil',
       'Bulgaria', 'Burkina Faso', 'Burundi', 'Cambodia', 'Cameroon',
       'Canada', 'Central African Republic', 'Chad', 'Chile', 'China',
       'Colombia', 'Comoros', 'Congo, Dem. Rep.', 'Congo, Rep.',
       'Costa Rica', "Cote d'Ivoire", 'Croatia', 'Cuba', 'Czech Republic',
       'Denmark', 'Djibouti', 'Dominican Republic', 'Ecuador', 'Egypt',
       'El Salvador', 'Equatorial Guinea', 'Eritrea', 'Ethiopia',
       'Finland', 'France', 'Gabon', 'Gambia', 'Germany', 'Ghana',
       'Greece', 'Guatemala', 'Guinea', 'Guinea-Bissau', 'Haiti',
       'Honduras', 'Hong Kong, China', 'Hungary', 'Iceland', 'India',
       'Indonesia', 'Iran', 'Iraq', 'Ireland', 'Israel', 'Italy',
       'Jamaica', 'Japan', 'Jordan', 'Kenya', 'Korea, Dem. Rep.',
       'Korea, Rep.', 'Kuwait', 'Leba

In [12]:
#how many are there?
len(data['country'].unique())

142

In [13]:
data['continent'].value_counts()

continent
Africa      624
Asia        396
Europe      360
Americas    300
Oceania      24
Name: count, dtype: int64

In [14]:
#Find the maximum life expectancy
data['lifeExp'].max()

82.603

In [15]:
#Find the minimum life expectancy
data['lifeExp'].min()

23.599

In [16]:
#Find the mean life expectancy
data['lifeExp'].mean()

59.474439366197174

### Subsetting a data frame

What do we mean by subsetting? Creating a new data frame that contains selected elements from the original data frame. This may be only certain columns, or only certain rows.

In [17]:
#Subset columns
country_pop = data[['country', 'pop']]

In [18]:
country_pop.head()

Unnamed: 0,country,pop
0,Afghanistan,8425333
1,Afghanistan,9240934
2,Afghanistan,10267083
3,Afghanistan,11537966
4,Afghanistan,13079460


In [19]:
#I would use this if I want all columns but a few
not_continent = data.drop(columns = ['continent'])

In [20]:
data.head()

Unnamed: 0,country,continent,year,lifeExp,pop,gdpPercap
0,Afghanistan,Asia,1952,28.801,8425333,779.445314
1,Afghanistan,Asia,1957,30.332,9240934,820.85303
2,Afghanistan,Asia,1962,31.997,10267083,853.10071
3,Afghanistan,Asia,1967,34.02,11537966,836.197138
4,Afghanistan,Asia,1972,36.088,13079460,739.981106


In [21]:
#Just 2002
idx = data['year'] == 2002
data_2002 = data.loc[idx]

In [22]:
data_2002.head(2)

Unnamed: 0,country,continent,year,lifeExp,pop,gdpPercap
10,Afghanistan,Asia,2002,42.129,25268405,726.734055
22,Albania,Europe,2002,75.651,3508512,4604.211737


In [23]:
#What if I want all of the data from the first year recorded?
#find earliest year
min_year = data['year'].min()
idx = data['year'] == min_year #mark the rows
data_early = data.loc[idx] #select out the rows

In [24]:
data_early.head()

Unnamed: 0,country,continent,year,lifeExp,pop,gdpPercap
0,Afghanistan,Asia,1952,28.801,8425333,779.445314
12,Albania,Europe,1952,55.23,1282697,1601.056136
24,Algeria,Africa,1952,43.077,9279525,2449.008185
36,Angola,Africa,1952,30.015,4232095,3520.610273
48,Argentina,Americas,1952,62.485,17876956,5911.315053


In [25]:
data['continent'].unique()

array(['Asia', 'Europe', 'Africa', 'Americas', 'Oceania'], dtype=object)

In [26]:
#What if we want all data from the Americas?
idx = data['continent'] == 'Americas'
data_Amer = data.loc[idx]

In [27]:
data_Amer.head()

Unnamed: 0,country,continent,year,lifeExp,pop,gdpPercap
48,Argentina,Americas,1952,62.485,17876956,5911.315053
49,Argentina,Americas,1957,64.399,19610538,6856.856212
50,Argentina,Americas,1962,65.142,21283783,7133.166023
51,Argentina,Americas,1967,65.634,22934225,8052.953021
52,Argentina,Americas,1972,67.065,24779799,9443.038526


In [28]:
data_Amer['continent'].value_counts()

continent
Americas    300
Name: count, dtype: int64

In [29]:
#What if we want all of the data from the Americas in 2007?
idx = data_Amer['year'] == 2007
data_Amer_2007 = data_Amer.loc[idx]

In [30]:
data_Amer_2007.head()

Unnamed: 0,country,continent,year,lifeExp,pop,gdpPercap
59,Argentina,Americas,2007,75.32,40301927,12779.37964
143,Bolivia,Americas,2007,65.554,9119152,3822.137084
179,Brazil,Americas,2007,72.39,190010647,9065.800825
251,Canada,Americas,2007,80.653,33390141,36319.23501
287,Chile,Americas,2007,78.553,16284741,13171.63885


### Boolean logic, and versus & and so forth

In [31]:
Amer_2007_idx = (data['continent'] == 'Americas') & (data['year'] == 2007)
data_Amer_2007 = data.loc[Amer_2007_idx]

In [32]:
data_Amer_2007.head()

Unnamed: 0,country,continent,year,lifeExp,pop,gdpPercap
59,Argentina,Americas,2007,75.32,40301927,12779.37964
143,Bolivia,Americas,2007,65.554,9119152,3822.137084
179,Brazil,Americas,2007,72.39,190010647,9065.800825
251,Canada,Americas,2007,80.653,33390141,36319.23501
287,Chile,Americas,2007,78.553,16284741,13171.63885


### Finding things: .loc versus .iloc

In [33]:
#What if we want to find the highest population in the Americas data 2007 dataframe?
data_Amer_2007['pop'].max()

301139947

In [34]:
#OK but how do we get the rest of the info?
data_Amer_2007['pop'].idxmax()

1619

In [35]:
#A few ways
data_Amer_2007.loc[1619]

country      United States
continent         Americas
year                  2007
lifeExp             78.242
pop              301139947
gdpPercap      42951.65309
Name: 1619, dtype: object

In [36]:
data_Amer_2007.iloc[1619]

IndexError: single positional indexer is out-of-bounds

In [70]:
data_Amer_2007.head()

Unnamed: 0,country,continent,year,lifeExp,pop,gdpPercap
59,Argentina,Americas,2007,75.32,40301927,12779.37964
143,Bolivia,Americas,2007,65.554,9119152,3822.137084
179,Brazil,Americas,2007,72.39,190010647,9065.800825
251,Canada,Americas,2007,80.653,33390141,36319.23501
287,Chile,Americas,2007,78.553,16284741,13171.63885


In [74]:
Amer_2007_idx = (data['continent'] == 'Americas') & (data['year'] == 2007)
data_Amer_2007 = data.loc[Amer_2007_idx]

In [75]:
data_Amer_2007 = data_Amer_2007.reset_index(drop = True)
data_Amer_2007.head()

Unnamed: 0,country,continent,year,lifeExp,pop,gdpPercap
0,Argentina,Americas,2007,75.32,40301927,12779.37964
1,Bolivia,Americas,2007,65.554,9119152,3822.137084
2,Brazil,Americas,2007,72.39,190010647,9065.800825
3,Canada,Americas,2007,80.653,33390141,36319.23501
4,Chile,Americas,2007,78.553,16284741,13171.63885


In [None]:
data_Amer_2007.head(7)

In [None]:
data_Amer_2007.head()

#### Using inplace = True properly

In [76]:
#Let's get all of the data for Canada, Mexico, and the US
idx_select = data['country'].isin(['Canada','Mexico','United States'])
data_select = data.loc[idx_select]

In [78]:
data_select.head(10)

Unnamed: 0,country,continent,year,lifeExp,pop,gdpPercap
240,Canada,Americas,1952,68.75,14785584,11367.16112
241,Canada,Americas,1957,69.96,17010154,12489.95006
242,Canada,Americas,1962,71.3,18985849,13462.48555
243,Canada,Americas,1967,72.13,20819767,16076.58803
244,Canada,Americas,1972,72.88,22284500,18970.57086
245,Canada,Americas,1977,74.21,23796400,22090.88306
246,Canada,Americas,1982,75.76,25201900,22898.79214
247,Canada,Americas,1987,76.86,26549700,26626.51503
248,Canada,Americas,1992,77.95,28523502,26342.88426
249,Canada,Americas,1997,78.61,30305843,28954.92589


In [79]:
data_select.set_index('year', inplace = True)

In [80]:
data_select.head()

Unnamed: 0_level_0,country,continent,lifeExp,pop,gdpPercap
year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1952,Canada,Americas,68.75,14785584,11367.16112
1957,Canada,Americas,69.96,17010154,12489.95006
1962,Canada,Americas,71.3,18985849,13462.48555
1967,Canada,Americas,72.13,20819767,16076.58803
1972,Canada,Americas,72.88,22284500,18970.57086


In [85]:
idx_select = data['country'].isin(['Canada','Mexico','United States'])
data_select = data.loc[idx_select]

In [86]:
data_select.set_index('year')
data_select.head()

Unnamed: 0,country,continent,year,lifeExp,pop,gdpPercap
240,Canada,Americas,1952,68.75,14785584,11367.16112
241,Canada,Americas,1957,69.96,17010154,12489.95006
242,Canada,Americas,1962,71.3,18985849,13462.48555
243,Canada,Americas,1967,72.13,20819767,16076.58803
244,Canada,Americas,1972,72.88,22284500,18970.57086


In [87]:
#DONT DO THIS, use inplace = True OR assignment
data_select = data_select.set_index('year', inplace = True)

In [88]:
data_select.head()

AttributeError: 'NoneType' object has no attribute 'head'

### Missing data and the quirks of nan

In [37]:
path_to_data = './data/test_data_v2.csv'
test_data = pd.read_csv(path_to_data)

In [38]:
test_data.head()

Unnamed: 0,Continent,Country,Year,Tourism_expenditure,Arrivals
0,Africa,Dem. Rep. of the Congo,1995,,35.0
1,Africa,Dem. Rep. of the Congo,2005,3.0,61.0
2,Africa,Dem. Rep. of the Congo,2010,11.0,81.0
3,Africa,Dem. Rep. of the Congo,2017,6.0,
4,Africa,Dem. Rep. of the Congo,2018,61.0,


#### Missing data

In [40]:
test_data.isnull().sum()

Continent              0
Country                0
Year                   0
Tourism_expenditure    1
Arrivals               6
dtype: int64

In [42]:
import numpy as np
print(np.nan)
print(type(np.nan))

nan
<class 'float'>


In [43]:
#What if we want to find the observations where arrivals are missing?
idx_miss = test_data['Arrivals'] == np.nan #mark the rows
test_data.loc[idx_miss] #get the rows

Unnamed: 0,Continent,Country,Year,Tourism_expenditure,Arrivals


In [None]:
#Hmmmm
#Let's try something

In [44]:
a = np.nan
b = np.nan
print(a == b)

False


In [47]:
#Get the rows where Arrival is null
print(pd.isnull(a))
print(pd.isnull(b))
print(pd.isnull(a) == pd.isnull(b))

True
True
True


In [48]:
#How would we get the rows where it is NOT null?
idx_miss = test_data['Arrivals'].isnull() #mark the rows
test_data.loc[idx_miss] #get the rows

Unnamed: 0,Continent,Country,Year,Tourism_expenditure,Arrivals
3,Africa,Dem. Rep. of the Congo,2017,6.0,
4,Africa,Dem. Rep. of the Congo,2018,61.0,
5,Africa,Dem. Rep. of the Congo,2019,100.0,
6,Europe,Denmark,1995,3691.0,
15,Africa,Djibouti,2017,36.0,
16,Africa,Djibouti,2018,57.0,


In [54]:
idx_miss = test_data['Arrivals'].isnull() #mark the rows
test_data.loc[~idx_miss] #get the rows

Unnamed: 0,Continent,Country,Year,Tourism_expenditure,Arrivals
0,Africa,Dem. Rep. of the Congo,1995,,35.0
1,Africa,Dem. Rep. of the Congo,2005,3.0,61.0
2,Africa,Dem. Rep. of the Congo,2010,11.0,81.0
7,Europe,Denmark,2005,5293.0,9178.0
8,Europe,Denmark,2010,5704.0,8744.0
9,Europe,Denmark,2017,8508.0,12426.0
10,Europe,Denmark,2018,9097.0,12749.0
11,Europe,Denmark,2019,8847.0,13285.0
12,Africa,Djibouti,1995,5.0,21.0
13,Africa,Djibouti,2005,7.0,30.0


In [52]:
idx_miss = test_data['Arrivals'].notnull() #mark the rows
test_data.loc[idx_miss] #get the rows

Unnamed: 0,Continent,Country,Year,Tourism_expenditure,Arrivals
0,Africa,Dem. Rep. of the Congo,1995,,35.0
1,Africa,Dem. Rep. of the Congo,2005,3.0,61.0
2,Africa,Dem. Rep. of the Congo,2010,11.0,81.0
7,Europe,Denmark,2005,5293.0,9178.0
8,Europe,Denmark,2010,5704.0,8744.0
9,Europe,Denmark,2017,8508.0,12426.0
10,Europe,Denmark,2018,9097.0,12749.0
11,Europe,Denmark,2019,8847.0,13285.0
12,Africa,Djibouti,1995,5.0,21.0
13,Africa,Djibouti,2005,7.0,30.0


In [58]:
test_data.drop(idx_miss.index)

Unnamed: 0,Continent,Country,Year,Tourism_expenditure,Arrivals


In [59]:
test_data

Unnamed: 0,Continent,Country,Year,Tourism_expenditure,Arrivals
0,Africa,Dem. Rep. of the Congo,1995,,35.0
1,Africa,Dem. Rep. of the Congo,2005,3.0,61.0
2,Africa,Dem. Rep. of the Congo,2010,11.0,81.0
3,Africa,Dem. Rep. of the Congo,2017,6.0,
4,Africa,Dem. Rep. of the Congo,2018,61.0,
5,Africa,Dem. Rep. of the Congo,2019,100.0,
6,Europe,Denmark,1995,3691.0,
7,Europe,Denmark,2005,5293.0,9178.0
8,Europe,Denmark,2010,5704.0,8744.0
9,Europe,Denmark,2017,8508.0,12426.0


In [61]:
test_data.dropna(subset = ['Arrivals']) #drop any rows where the values in specified columns are missing

Unnamed: 0,Continent,Country,Year,Tourism_expenditure,Arrivals
0,Africa,Dem. Rep. of the Congo,1995,,35.0
1,Africa,Dem. Rep. of the Congo,2005,3.0,61.0
2,Africa,Dem. Rep. of the Congo,2010,11.0,81.0
7,Europe,Denmark,2005,5293.0,9178.0
8,Europe,Denmark,2010,5704.0,8744.0
9,Europe,Denmark,2017,8508.0,12426.0
10,Europe,Denmark,2018,9097.0,12749.0
11,Europe,Denmark,2019,8847.0,13285.0
12,Africa,Djibouti,1995,5.0,21.0
13,Africa,Djibouti,2005,7.0,30.0


In [62]:
help(test_data.dropna)

Help on method dropna in module pandas.core.frame:

dropna(*, axis: 'Axis' = 0, how: 'AnyAll | lib.NoDefault' = <no_default>, thresh: 'int | lib.NoDefault' = <no_default>, subset: 'IndexLabel | None' = None, inplace: 'bool' = False, ignore_index: 'bool' = False) -> 'DataFrame | None' method of pandas.core.frame.DataFrame instance
    Remove missing values.
    
    See the :ref:`User Guide <missing_data>` for more on which values are
    considered missing, and how to work with missing data.
    
    Parameters
    ----------
    axis : {0 or 'index', 1 or 'columns'}, default 0
        Determine if rows or columns which contain missing values are
        removed.
    
        * 0, or 'index' : Drop rows which contain missing values.
        * 1, or 'columns' : Drop columns which contain missing value.
    
        Only a single axis is allowed.
    
    how : {'any', 'all'}, default 'any'
        Determine if row or column is removed from DataFrame, when we have
        at least one NA