# Pandas

## 1. Pandas Basics

### 1.1. Creating Pandas Series
![](https://github.com/VinitaSilaparasetty/Coursera-Pandas-for-Beginners/blob/master/Media/series.png?raw=true)

In [23]:
import numpy as np
import pandas as pd

In [25]:
# Creating pandas series by passing a list
s = [1, np.nan, ' Pandas Library ']
s1 = pd.Series(s)
s1

0                   1
1                 NaN
2     Pandas Library 
dtype: object

In [27]:
# Creating pandas series by passing numpy array
s2 = np.array([2, np.nan, 'b'])
s2 = pd.Series(s2)
s2


0      2
1    nan
2      b
dtype: object

In [29]:
list1 = ['animal', '2', 'animal']
s3 = pd.Series(list1)
s3

0    animal
1         2
2    animal
dtype: object

In [31]:
list2 = [3, 'c', "Numpy"]
s4 = pd.Series(list2)
s4

0        3
1        c
2    Numpy
dtype: object

In [33]:
# Creating pandas series by passing a dictionary
s5 = pd.Series(
  {'A': 1,
  '3': 'Python',
   })
s5

A         1
3    Python
dtype: object

In [35]:
s6 = pd.Series(
    {'Integer': 3,
    'B': 'Boys',
    })
s6

Integer       3
B          Boys
dtype: object

### 1.2. Creating Dataframes in Pandas
![](https://github.com/VinitaSilaparasetty/Coursera-Pandas-for-Beginners/blob/master/Media/dataframe.gif?raw=true)

In [37]:
# Passing a numpy array to create a dataframe
df = pd.DataFrame(np.random.randn(6,4)) # randn generates a random matrix with given dimensions
print(df.iloc[0, 0]) # df.ix[] is deprecated, so we have this instead (to get first row and first column)
print(df.loc[2, 1]) # could also be df.loc[2, 'meow'] if name of column was 'meow'
df

0.3819838522071795
0.9817373405661077


Unnamed: 0,0,1,2,3
0,0.381984,-0.582466,0.705956,-0.271332
1,-0.148133,-0.267256,-0.511798,-0.713045
2,-1.74121,0.981737,0.399071,0.116008
3,1.000557,1.349506,1.07283,-0.18705
4,0.000784,-1.944793,-0.586982,0.295455
5,-0.588006,0.055948,1.377619,1.897821


* loc is primarily label based; when two arguments are used, you use column headers and row indexes to select the data you want. loc can also take an integer as a row or column number.
* iloc is integer-based. You use column numbers and row numbers to get rows or columns at particular positions in the data frame.
* By default, ix looks for a label. If ix doesn't find a label, it will use an integer. This means you can select data by using either column numbers and row numbers or column headers and row names using ix.

* In Pandas version 0.20.0 and later, ix is deprecated.

In [39]:
df1 = pd.DataFrame(np.random.randn(3,3))
df1

Unnamed: 0,0,1,2
0,-1.476632,0.013713,0.83867
1,0.072095,0.429955,-1.110153
2,0.138181,-0.535844,0.752361


In [41]:
# Passing a dictionary to create a pandas dataframe
df2 = pd.DataFrame({
    'A': 1,
    'number':np.array([6]*3,dtype='int32'),
})
df2

Unnamed: 0,A,number
0,1,6
1,1,6
2,1,6


In [43]:
df3 = pd.DataFrame({
    'E': np.array([4] * 5, dtype='int32'),
    'Day': 2,
})
df3

Unnamed: 0,E,Day
0,4,2
1,4,2
2,4,2
3,4,2
4,4,2


### 1.3. Importing Files

In [45]:
%%bash
ls -R
git clone https://github.com/VinitaSilaparasetty/Coursera-Pandas-for-Beginners/ || (cd Coursera-Pandas-for-Beginners ; git pull)


.:
Coursera-Pandas-for-Beginners
example_1.csv
pandas_notes.ipynb

./Coursera-Pandas-for-Beginners:
LICENSE
Media
practicedata.csv
practicedata.json
README.md

./Coursera-Pandas-for-Beginners/Media:
dataframe.gif
series.png
social media buttons.png
Already up to date.
fatal: destination path 'Coursera-Pandas-for-Beginners' already exists and is not an empty directory.


In [47]:
# CSV Files
df4 = pd.read_csv('Coursera-Pandas-for-Beginners/practicedata.csv')
df4.head(3) # Allows us to read only the first three columns

Unnamed: 0,A,B,C,D
0,,-1.604969,-0.106263,-1.002924
1,-0.404667,0.458565,-1.174912,
2,0.440231,1.71348,0.162473,-1.632132


In [49]:
# Importing json
df5 = pd.read_json('Coursera-Pandas-for-Beginners/practicedata.json')
df5

Unnamed: 0,A,B,C,D
0,,-1.604969,-0.106263,-1.002924
1,-0.404667,0.458565,-1.174912,
2,0.440231,1.71348,0.162473,-1.632132
3,0.08105,0.639851,0.844037,1.463154
4,-1.013616,-0.224553,1.786915,1.041241
5,-1.013616,-0.224553,1.786915,1.041241
6,-0.488965,0.946528,0.829525,-0.529912
7,-0.488965,0.946528,0.829525,-0.529912


In [51]:
# Importing from URL
df6 = pd.read_csv('https://people.sc.fsu.edu/~jburkardt/data/csv/addresses.csv')
df6

Unnamed: 0,John,Doe,120 jefferson st.,Riverside,NJ,08075
0,Jack,McGinnis,220 hobo Av.,Phila,PA,9119
1,"John ""Da Man""",Repici,120 Jefferson St.,Riverside,NJ,8075
2,Stephen,Tyler,"7452 Terrace ""At the Plaza"" road",SomeTown,SD,91234
3,,Blankman,,SomeTown,SD,298
4,"Joan ""the bone"", Anne",Jet,"9th, at Terrace plc",Desert City,CO,123


In [53]:
# Exporting data
df6.to_csv('example_1.csv') # You cannot create new directories, but can create new files

In [55]:
%%bash
ls

Coursera-Pandas-for-Beginners
example_1.csv
pandas_notes.ipynb


### 1.4. Summarizing Data

In [57]:
# Quick summary of the data
df5.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8 entries, 0 to 7
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   A       7 non-null      float64
 1   B       8 non-null      float64
 2   C       8 non-null      float64
 3   D       7 non-null      float64
dtypes: float64(4)
memory usage: 384.0 bytes


In [59]:
# Detailed summary of the data
df5.describe()

Unnamed: 0,A,B,C,D
count,7.0,8.0,8.0,7.0
mean,-0.41265,0.33136,0.619777,-0.021321
std,0.533101,1.010369,0.98705,1.192849
min,-1.013616,-1.604969,-1.174912,-1.632132
25%,-0.751291,-0.224553,0.095289,-0.766418
50%,-0.488965,0.549208,0.829525,-0.529912
75%,-0.161808,0.946528,1.079756,1.041241
max,0.440231,1.71348,1.786915,1.463154


In [61]:
# View columns
df5.columns

Index(['A', 'B', 'C', 'D'], dtype='object')

In [63]:
# View datatypes
df5.dtypes

A    float64
B    float64
C    float64
D    float64
dtype: object

In [65]:
# Detect duplicate rows
df5.duplicated()

0    False
1    False
2    False
3    False
4    False
5     True
6    False
7     True
dtype: bool

In [67]:
# Drop duplicate rows
df5.drop_duplicates() # Rows 5 and 7 will be removed

Unnamed: 0,A,B,C,D
0,,-1.604969,-0.106263,-1.002924
1,-0.404667,0.458565,-1.174912,
2,0.440231,1.71348,0.162473,-1.632132
3,0.08105,0.639851,0.844037,1.463154
4,-1.013616,-0.224553,1.786915,1.041241
6,-0.488965,0.946528,0.829525,-0.529912


In [69]:
# Detect missing values
print(df5.isnull().sum()) # Column A and D have 1 null value

A    1
B    0
C    0
D    1
dtype: int64


In [71]:
# Drop missing values
df5.dropna()

Unnamed: 0,A,B,C,D
2,0.440231,1.71348,0.162473,-1.632132
3,0.08105,0.639851,0.844037,1.463154
4,-1.013616,-0.224553,1.786915,1.041241
5,-1.013616,-0.224553,1.786915,1.041241
6,-0.488965,0.946528,0.829525,-0.529912
7,-0.488965,0.946528,0.829525,-0.529912



### 1.5. Numeric Operations

In [73]:
df8 = pd.DataFrame(np.random.randn(6,4))
df8

Unnamed: 0,0,1,2,3
0,0.624748,-1.313462,0.780397,0.191427
1,0.118784,0.043062,-0.077965,-0.001495
2,1.089557,0.886608,-0.016494,2.416067
3,-0.103098,0.554888,0.310791,-0.900836
4,0.693979,0.355918,0.301785,0.990279
5,-1.536101,0.476351,1.03822,0.300871


In [75]:
# Calculate the mean
df8.mean()


0    0.147978
1    0.167228
2    0.389456
3    0.499386
dtype: float64

In [77]:
# Calculate the cumulative sum
df8.apply(np.cumsum)

Unnamed: 0,0,1,2,3
0,0.624748,-1.313462,0.780397,0.191427
1,0.743532,-1.270401,0.702432,0.189932
2,1.833089,-0.383792,0.685938,2.605999
3,1.72999,0.171096,0.996729,1.705163
4,2.423969,0.527014,1.298514,2.695442
5,0.887868,1.003365,2.336734,2.996313


In [79]:
# Find the max value
df8.max()


0    1.089557
1    0.886608
2    1.038220
3    2.416067
dtype: float64

In [80]:
# Find the min value
df8.min()

0   -1.536101
1   -1.313462
2   -0.077965
3   -0.900836
dtype: float64

### 1.6. String Manipulation

In [81]:
s9 = np.array(['animal', 'bird', ' Pandas Library '])
s9 = pd.Series(s9)
s9

0              animal
1                bird
2     Pandas Library 
dtype: object

In [82]:
# Convert to lowercase
s9.str.lower()

0              animal
1                bird
2     pandas library 
dtype: object

In [83]:
# Swap capitalizaiton
s9.str.swapcase()

0              ANIMAL
1                BIRD
2     pANDAS lIBRARY 
dtype: object

In [84]:
# Find length of string
s9.str.len()

0     6
1     4
2    16
dtype: int64

In [85]:
# Take cumulative sum of them.. Cuz why not
s9.str.len().cumsum()

0     6
1    10
2    26
dtype: int64

In [86]:
# Split string
s9.str.split()

0             [animal]
1               [bird]
2    [Pandas, Library]
dtype: object

In [87]:
# Detect unique values
s9.unique()

array(['animal', 'bird', ' Pandas Library '], dtype=object)

In [88]:
# Repeat string

repeat_list = [2, 3, 2]
s9.str.repeat(repeat_list)

0                        animalanimal
1                        birdbirdbird
2     Pandas Library  Pandas Library 
dtype: object

## 2. Pandas Advanced 
### 2.0. About the dataset used below:

The database includes data from Botswana, Burkina Faso, Cameroon, Ethiopia, The Gambia, Ghana, Kenya, Lesotho, Liberia, Madagascar, Malawi, Mauritius, Nigeria, Sudan, Swaziland, Zaire, Zambia, and Zimbabwe.

If a donor gives aid for a project that the recipient government would have undertaken anyway, then the aid is financing some expenditure other than the intended project. The notion that aid in this sense may be "fungible," while long recognized, has recently been receiving some empirical support.
Modifications:

* Three entries in the column 'popn' have been deleted at random, in order to create missing values for teaching purposes.
* Only the first 302 rows of the complete dataset are present in this subset.

see [variable description file](https://github.com/VinitaSilaparasetty/Coursera-Intermediate-Pandas/blob/master/variable%20description-%20what%20does%20aid%20to%20africa%20finance.pdf)

In [89]:
# Loading new data
df = pd.read_csv("https://raw.githubusercontent.com/VinitaSilaparasetty/Coursera-Intermediate-Pandas/master/What_does_aid_to_Africa_finance_1.csv")
df.head(10) # Ensure the data has loaded correctly.


Unnamed: 0,countryc,year,agrgdp,popn,infmort,schprim,schsec,grtdsbp,grlndsbp,aiddsbp,...,dcurexpp,dcapexpp,dprirepp,dcnlnagp,dcnlnenp,dcnlninp,dcnlntacp,dcnlnedup,dcnlnhthp,dcnlnothp
0,Burkina Faso,1970,35.44188862,5633000.0,141.3999939,13.0,1.0,13.3182802200317,1.02303504943848,14.3413200378418,...,1.79769313486232e+308,1.79769313486232e+308,1.79769313486232e+308,1.79769313486232e+308,1.79769313486232e+308,1.79769313486232e+308,1.79769313486232e+308,1.79769313486232e+308,1.79769313486232e+308,1.79769313486232e+308
1,Burkina Faso,1970,35.44188862,5633000.0,141.3999939,13.0,1.0,13.3182802200317,1.02303504943848,14.3413200378418,...,1.79769313486232e+308,1.79769313486232e+308,1.79769313486232e+308,1.79769313486232e+308,1.79769313486232e+308,1.79769313486232e+308,1.79769313486232e+308,1.79769313486232e+308,1.79769313486232e+308,1.79769313486232e+308
2,Burkina Faso,1971,36.16739069,5740700.0,139.1999969,13.6,1.2,16.7043991088867,0.655763506889343,17.3601703643799,...,1.79769313486232e+308,1.79769313486232e+308,1.79769313486232e+308,1.79769313486232e+308,1.79769313486232e+308,1.79769313486232e+308,1.79769313486232e+308,1.79769313486232e+308,1.79769313486232e+308,1.79769313486232e+308
3,Burkina Faso,1972,37.51058767,5848380.0,137.0,14.2,1.4,20.9176502227783,2.97720909118652,23.8948593139648,...,1.79769313486232e+308,1.79769313486232e+308,1.79769313486232e+308,1.79769313486232e+308,1.79769313486232e+308,1.79769313486232e+308,1.79769313486232e+308,1.79769313486232e+308,1.79769313486232e+308,1.79769313486232e+308
4,Burkina Faso,1973,34.83428571,5958700.0,135.0,14.8,1.6,25.9791507720947,3.87817406654358,29.8573207855225,...,1.79769313486232e+308,1.79769313486232e+308,-4.26292991638184,0.290098994970322,0.0578910000622272,2.53049802780151,-0.38238000869751,0,0,-0.11642000079155
5,Burkina Faso,1973,34.83428571,5958700.0,135.0,14.8,1.6,25.9791507720947,3.87817406654358,29.8573207855225,...,1.79769313486232e+308,1.79769313486232e+308,-4.26292991638184,0.290098994970322,0.0578910000622272,2.53049802780151,-0.38238000869751,0,0,-0.11642000079155
6,Burkina Faso,1974,36.48014145,6075700.0,133.0,15.4,1.8,38.6305809020996,6.66203498840332,45.2926216125488,...,1.79769313486232e+308,1.79769313486232e+308,1.79769313486232e+308,1.79769313486232e+308,1.79769313486232e+308,1.79769313486232e+308,1.79769313486232e+308,1.79769313486232e+308,1.79769313486232e+308,1.79769313486232e+308
7,Burkina Faso,1975,34.27100776,6202000.0,131.0,16.0,2.0,30.4486293792725,7.36860179901123,37.8172302246094,...,5.94059085845947,0.0831290036439896,0.118987999856472,0.262962996959686,0.517924010753632,-2.78498005867004,0.806988000869751,0.0681129992008209,0,2.65820407867432
8,Burkina Faso,1976,34.80431988,,129.0,15.0,2.0,24.3181304931641,8.15182018280029,32.4699592590332,...,0.0346110016107559,2.13481593132019,-0.0604999996721744,0.739816009998322,-0.850790023803711,-0.320169985294342,0.252671003341675,0.0882859975099564,0,0.984821021556854
9,Burkina Faso,1977,34.31152713,6486870.0,127.0,16.0,2.0,27.812780380249,11.5194902420044,39.3322715759277,...,3.9152410030365,1.05086898803711,-0.137299999594688,-0.195250004529953,0,2.32170295715332,1.82917904853821,0.17212900519371,1.17611503601074,0.820497989654541


### 2.1. Splitting Data

In [90]:
# Create a copy of the dataset
df_new = df.copy()

# First subset
df1 = df_new.sample(frac=0.25, random_state=0)

# Drop values assigned to df1
df_new = df_new.drop(df1.index) 

# Second subset
df2 = df_new.sample(frac=0.25, random_state=0)

df_new = df_new.drop(df2.index)

# Third subset
df3 = df_new.sample(frac=0.25, random_state=0)

# The remaning values of df_new can now be directly assigned to df4
df4 = df_new.drop(df3.index)



### 2.2. Handle Missing Values

In [91]:
# Detect missing values
print(df3.isnull().sum())

countryc     0
year         0
agrgdp       0
popn         1
infmort      0
schprim      0
schsec       0
grtdsbp      0
grlndsbp     0
aiddsbp      0
totexpp      0
agexpp       0
enexpp       0
indexpp      0
tacexpp      0
eduexpp      0
hthexpp      0
prirepp      0
curexpp      0
capexpp      0
gdnpp        0
d0           0
cnlnagp      0
cnlnenp      0
cnlninp      0
cnlntacp     0
cnlnedup     0
cnlnhthp     0
cnlnothp     0
dgrtdsbp     0
dgrlndsbp    0
daiddsbp     0
dtotexpp     0
dagexpp      0
denexpp      0
dindexpp     0
dtacexpp     0
deduexpp     0
dhthexpp     0
dothexpp     0
dcurexpp     0
dcapexpp     0
dprirepp     0
dcnlnagp     0
dcnlnenp     0
dcnlninp     0
dcnlntacp    0
dcnlnedup    0
dcnlnhthp    0
dcnlnothp    0
dtype: int64


![](https://github.com/VinitaSilaparasetty/Coursera-Intermediate-Pandas/blob/master/media/imputation.gif?raw=true)

In [92]:
# Impute missing values
# Imputation is a method of predicting missing values based on observed values
# Mean imputation (see pic above)

df3['popn']

# Ctd...


237     6901230.0
244     8251580.0
299     5947940.0
87     34759760.0
91     38772368.0
260    13248750.0
14      7308230.0
157     1113000.0
207     2596430.0
160    12329860.0
255    11301740.0
9       6486870.0
138      566690.0
181    25347400.0
88     34759760.0
33       755100.0
102    54790000.0
164    14255000.0
100    51180000.0
104    54890000.0
275      993850.0
4       5958700.0
171    18597660.0
289           NaN
180    24679850.0
61      7920570.0
110     9621420.0
281     1040330.0
75     11825390.0
43      1075000.0
54      6506000.0
213     1131220.0
71     10535260.0
16      7681280.0
203     2391540.0
153      964690.0
161    12778060.0
273      966000.0
279     1024540.0
40       966940.0
50      1353060.0
193     1819140.0
Name: popn, dtype: float64

In [93]:
df3['popn'].mean()

12570541.658536585

In [94]:
df3.isnull()

Unnamed: 0,countryc,year,agrgdp,popn,infmort,schprim,schsec,grtdsbp,grlndsbp,aiddsbp,...,dcurexpp,dcapexpp,dprirepp,dcnlnagp,dcnlnenp,dcnlninp,dcnlntacp,dcnlnedup,dcnlnhthp,dcnlnothp
237,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
244,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
299,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
87,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
91,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
260,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
14,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
157,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
207,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
160,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False


In [95]:
df3['popn'].fillna(df3['popn'].mean(), inplace=True)
print(df3.isnull().sum())

countryc     0
year         0
agrgdp       0
popn         0
infmort      0
schprim      0
schsec       0
grtdsbp      0
grlndsbp     0
aiddsbp      0
totexpp      0
agexpp       0
enexpp       0
indexpp      0
tacexpp      0
eduexpp      0
hthexpp      0
prirepp      0
curexpp      0
capexpp      0
gdnpp        0
d0           0
cnlnagp      0
cnlnenp      0
cnlninp      0
cnlntacp     0
cnlnedup     0
cnlnhthp     0
cnlnothp     0
dgrtdsbp     0
dgrlndsbp    0
daiddsbp     0
dtotexpp     0
dagexpp      0
denexpp      0
dindexpp     0
dtacexpp     0
deduexpp     0
dhthexpp     0
dothexpp     0
dcurexpp     0
dcapexpp     0
dprirepp     0
dcnlnagp     0
dcnlnenp     0
dcnlninp     0
dcnlntacp    0
dcnlnedup    0
dcnlnhthp    0
dcnlnothp    0
dtype: int64


![](https://github.com/VinitaSilaparasetty/Coursera-Intermediate-Pandas/blob/master/media/interpolation.gif?raw=true)

In [96]:
# Interpolating missing values (see pic above)

# Detect missing values
print(df1.isnull().sum())
# Ctd...

countryc     0
year         0
agrgdp       0
popn         1
infmort      0
schprim      0
schsec       0
grtdsbp      0
grlndsbp     0
aiddsbp      0
totexpp      0
agexpp       0
enexpp       0
indexpp      0
tacexpp      0
eduexpp      0
hthexpp      0
prirepp      0
curexpp      0
capexpp      0
gdnpp        0
d0           0
cnlnagp      0
cnlnenp      0
cnlninp      0
cnlntacp     0
cnlnedup     0
cnlnhthp     0
cnlnothp     0
dgrtdsbp     0
dgrlndsbp    0
daiddsbp     0
dtotexpp     0
dagexpp      0
denexpp      0
dindexpp     0
dtacexpp     0
deduexpp     0
dhthexpp     0
dothexpp     0
dcurexpp     0
dcapexpp     0
dprirepp     0
dcnlnagp     0
dcnlnenp     0
dcnlninp     0
dcnlntacp    0
dcnlnedup    0
dcnlnhthp    0
dcnlnothp    0
dtype: int64


In [97]:
df1['popn'].fillna(df1['popn'].interpolate(), inplace=True)

print(df1.isnull().sum())

countryc     0
year         0
agrgdp       0
popn         0
infmort      0
schprim      0
schsec       0
grtdsbp      0
grlndsbp     0
aiddsbp      0
totexpp      0
agexpp       0
enexpp       0
indexpp      0
tacexpp      0
eduexpp      0
hthexpp      0
prirepp      0
curexpp      0
capexpp      0
gdnpp        0
d0           0
cnlnagp      0
cnlnenp      0
cnlninp      0
cnlntacp     0
cnlnedup     0
cnlnhthp     0
cnlnothp     0
dgrtdsbp     0
dgrlndsbp    0
daiddsbp     0
dtotexpp     0
dagexpp      0
denexpp      0
dindexpp     0
dtacexpp     0
deduexpp     0
dhthexpp     0
dothexpp     0
dcurexpp     0
dcapexpp     0
dprirepp     0
dcnlnagp     0
dcnlnenp     0
dcnlninp     0
dcnlntacp    0
dcnlnedup    0
dcnlnhthp    0
dcnlnothp    0
dtype: int64


* If the data has a linear relationship use interpolation, otherwise use imputation. Best choice depends on the business objectives (no free lunch)

#### 2.2.1 Example missing data handling challenge
* Assume you're studying the effects of infant mortality rate on each of the variables in the data frame. Detect missing values in df2 and decide on the best method to handle them

In [98]:
df2.isnull().sum()

countryc     0
year         0
agrgdp       0
popn         1
infmort      0
schprim      0
schsec       0
grtdsbp      0
grlndsbp     0
aiddsbp      0
totexpp      0
agexpp       0
enexpp       0
indexpp      0
tacexpp      0
eduexpp      0
hthexpp      0
prirepp      0
curexpp      0
capexpp      0
gdnpp        0
d0           0
cnlnagp      0
cnlnenp      0
cnlninp      0
cnlntacp     0
cnlnedup     0
cnlnhthp     0
cnlnothp     0
dgrtdsbp     0
dgrlndsbp    0
daiddsbp     0
dtotexpp     0
dagexpp      0
denexpp      0
dindexpp     0
dtacexpp     0
deduexpp     0
dhthexpp     0
dothexpp     0
dcurexpp     0
dcapexpp     0
dprirepp     0
dcnlnagp     0
dcnlnenp     0
dcnlninp     0
dcnlntacp    0
dcnlnedup    0
dcnlnhthp    0
dcnlnothp    0
dtype: int64

In [99]:
df2[df2.isnull().any(axis=1)]

Unnamed: 0,countryc,year,agrgdp,popn,infmort,schprim,schsec,grtdsbp,grlndsbp,aiddsbp,...,dcurexpp,dcapexpp,dprirepp,dcnlnagp,dcnlnenp,dcnlninp,dcnlntacp,dcnlnedup,dcnlnhthp,dcnlnothp
48,Botswana,1990,5.456680968,,55.8,114,42,125.423896789551,14.6862802505493,140.110198974609,...,147.043899536133,57.2380714416504,-0.408820003271103,-0.127890005707741,2.52362895011902,-8.80912017822266,-8.8169002532959,-11.6063995361328,-0.0971599966287613,6.63489723205566


In [100]:
# It is obvious that the infant mortality rate has a direct impact on population size.
# > Population is dependant on infant mortality rate
# > Interpolation

df2['popn'].fillna(df2['infmort'].interpolate(), inplace=True)
df2

# Interpolation: insert (something of a different nature) into something else.

# This solution doesn't really just consider 'infmort' ..?

Unnamed: 0,countryc,year,agrgdp,popn,infmort,schprim,schsec,grtdsbp,grlndsbp,aiddsbp,...,dcurexpp,dcapexpp,dprirepp,dcnlnagp,dcnlnenp,dcnlninp,dcnlntacp,dcnlnedup,dcnlnhthp,dcnlnothp
168,Kenya,1980,32.59223808,16560000.0,72.40000153,115,20,26.8336296081543,17.7571392059326,44.5907592773438,...,-1.90575003623962,4.4354100227356,-0.64258998632431,-0.43707999587059,-2.52485990524292,-0.424710005521774,-3.0382399559021,-0.0560100004076958,0,7.8144268989563
109,Ghana,1973,48.97463727,9388140.0,106.0,68.2,27.8,7.54381990432739,7.85923004150391,15.4030504226685,...,-13.4622001647949,-5.45116996765137,-0.186719998717308,-0.00694000022485852,-3.54000997543335,0.0770640000700951,-0.0829199999570846,0,0,-0.743390023708344
204,Liberia,1990,1.79769313486232e+308,2435000.0,176.8,1.79769313486232e+308,1.79769313486232e+308,38.6235008239746,16.2881603240967,54.9116592407227,...,1.79769313486232e+308,1.79769313486232e+308,1.79769313486232e+308,1.79769313486232e+308,1.79769313486232e+308,1.79769313486232e+308,1.79769313486232e+308,1.79769313486232e+308,1.79769313486232e+308,1.79769313486232e+308
247,Madagascar,1981,33.07584521,8951460.0,134.0,1.79769313486232e+308,1.79769313486232e+308,17.8668098449707,25.6996097564697,43.5664291381836,...,1.79769313486232e+308,1.79769313486232e+308,1.79769313486232e+308,1.79769313486232e+308,1.79769313486232e+308,1.79769313486232e+308,1.79769313486232e+308,1.79769313486232e+308,1.79769313486232e+308,1.79769313486232e+308
99,Ethiopia,1989,48.50468489,49337300.0,126.8,34,14,15.0112895965576,3.97879600524902,18.9900798797607,...,0.919135987758636,2.5445671081543,0.135925993323326,-0.0909200012683868,-0.922819972038269,0.01554000005126,0.00121000001672655,-0.114739999175072,0.0564300008118153,-3.32821989059448
228,Lesotho,1988,24.40058125,1689570.0,88.2,107,24.2,67.3578262329102,17.261739730835,84.6195831298828,...,8.38925743103027,-1.36620998382568,-0.754760026931763,-0.180490002036095,2.19999504089355,-0.662069976329804,-0.273030012845993,-1.6026200056076,0.604721009731293,0.780026018619537
172,Kenya,1984,33.91489131,19302100.0,64.8,102.2,20.8,20.7450504302978,10.9030799865723,31.6481304168701,...,-2.49663996696472,-3.95836997032166,0.110414996743202,0.153242006897926,2.12988901138306,-0.622720003128052,-1.99763000011444,0.14376200735569,0.651859998703003,5.19104719161987
97,Ethiopia,1987,49.64547916,46087100.0,132.0,34.5,13,13.8647003173828,4.37149715423584,18.2362003326416,...,-1.71854996681213,-0.485489994287491,-0.029389999806881,0.451462000608444,-1.4795800447464,-1.06228995323181,0.707652986049652,-0.030850000679493,0,1.83635902404785
277,Mauritius,1984,14.406639,1011330.0,26.4,106.6,46.8,30.8346195220947,23.7189407348633,54.5535507202148,...,-60.2523002624512,-8.94552993774414,2.55853796005249,-0.0776799991726875,-6.83859014511108,0.616131007671356,-4.17440986633301,0,0,1.90372800827026
220,Lesotho,1980,23.58414239,1367000.0,108.4000015,102,18,109.822998046875,14.0071697235107,123.830200195312,...,1.79769313486232e+308,1.79769313486232e+308,1.79769313486232e+308,1.79769313486232e+308,1.79769313486232e+308,1.79769313486232e+308,1.79769313486232e+308,1.79769313486232e+308,1.79769313486232e+308,1.79769313486232e+308


In [101]:
df2['countryc'].unique()

array(['Kenya', 'Ghana', 'Liberia', 'Madagascar', 'Ethiopia', 'Lesotho',
       'Mauritius', 'Burkina Faso', 'Cameroon', 'Gambia, The', 'Botswana'],
      dtype=object)

In [111]:
df2['infmort'] = pd.to_numeric(df2["infmort"], downcast="float")
df2['infmort'].dtype # used to be of object type, now is of type float32



dtype('float32')

In [112]:
df2['infmort']>=50

168     True
109     True
204     True
247     True
99      True
228     True
172     True
97      True
277    False
220     True
188     True
126     True
84      True
11      True
198     True
252     True
185     True
6       True
57      True
276    False
257     True
187     True
219     True
235     True
141     True
83      True
266     True
48      True
17      True
159     True
209     True
174     True
139     True
72      True
199     True
25      True
21      True
195     True
10      True
125     True
231     True
274    False
189     True
140     True
120     True
94      True
42      True
283    False
178     True
270    False
131     True
240     True
256     True
98      True
23      True
191     True
Name: infmort, dtype: bool

In [115]:
dfx = df2[df2['infmort']>=50] # so only places where its true are copied to dfx
dfx.head(10)

Unnamed: 0,countryc,year,agrgdp,popn,infmort,schprim,schsec,grtdsbp,grlndsbp,aiddsbp,...,dcurexpp,dcapexpp,dprirepp,dcnlnagp,dcnlnenp,dcnlninp,dcnlntacp,dcnlnedup,dcnlnhthp,dcnlnothp
168,Kenya,1980,32.59223808,16560000.0,72.400002,115,20,26.8336296081543,17.7571392059326,44.5907592773438,...,-1.90575003623962,4.4354100227356,-0.64258998632431,-0.43707999587059,-2.52485990524292,-0.424710005521774,-3.0382399559021,-0.0560100004076958,0,7.8144268989563
109,Ghana,1973,48.97463727,9388140.0,106.0,68.2,27.8,7.54381990432739,7.85923004150391,15.4030504226685,...,-13.4622001647949,-5.45116996765137,-0.186719998717308,-0.00694000022485852,-3.54000997543335,0.0770640000700951,-0.0829199999570846,0,0,-0.743390023708344
204,Liberia,1990,1.79769313486232e+308,2435000.0,176.800003,1.79769313486232e+308,1.79769313486232e+308,38.6235008239746,16.2881603240967,54.9116592407227,...,1.79769313486232e+308,1.79769313486232e+308,1.79769313486232e+308,1.79769313486232e+308,1.79769313486232e+308,1.79769313486232e+308,1.79769313486232e+308,1.79769313486232e+308,1.79769313486232e+308,1.79769313486232e+308
247,Madagascar,1981,33.07584521,8951460.0,134.0,1.79769313486232e+308,1.79769313486232e+308,17.8668098449707,25.6996097564697,43.5664291381836,...,1.79769313486232e+308,1.79769313486232e+308,1.79769313486232e+308,1.79769313486232e+308,1.79769313486232e+308,1.79769313486232e+308,1.79769313486232e+308,1.79769313486232e+308,1.79769313486232e+308,1.79769313486232e+308
99,Ethiopia,1989,48.50468489,49337300.0,126.800003,34,14,15.0112895965576,3.97879600524902,18.9900798797607,...,0.919135987758636,2.5445671081543,0.135925993323326,-0.0909200012683868,-0.922819972038269,0.01554000005126,0.00121000001672655,-0.114739999175072,0.0564300008118153,-3.32821989059448
228,Lesotho,1988,24.40058125,1689570.0,88.199997,107,24.2,67.3578262329102,17.261739730835,84.6195831298828,...,8.38925743103027,-1.36620998382568,-0.754760026931763,-0.180490002036095,2.19999504089355,-0.662069976329804,-0.273030012845993,-1.6026200056076,0.604721009731293,0.780026018619537
172,Kenya,1984,33.91489131,19302100.0,64.800003,102.2,20.8,20.7450504302978,10.9030799865723,31.6481304168701,...,-2.49663996696472,-3.95836997032166,0.110414996743202,0.153242006897926,2.12988901138306,-0.622720003128052,-1.99763000011444,0.14376200735569,0.651859998703003,5.19104719161987
97,Ethiopia,1987,49.64547916,46087100.0,132.0,34.5,13,13.8647003173828,4.37149715423584,18.2362003326416,...,-1.71854996681213,-0.485489994287491,-0.029389999806881,0.451462000608444,-1.4795800447464,-1.06228995323181,0.707652986049652,-0.030850000679493,0,1.83635902404785
220,Lesotho,1980,23.58414239,1367000.0,108.400002,102,18,109.822998046875,14.0071697235107,123.830200195312,...,1.79769313486232e+308,1.79769313486232e+308,1.79769313486232e+308,1.79769313486232e+308,1.79769313486232e+308,1.79769313486232e+308,1.79769313486232e+308,1.79769313486232e+308,1.79769313486232e+308,1.79769313486232e+308
188,Liberia,1974,32.28125548,1560820.0,175.399994,60.8,15.6,20.0104808807373,12.8318300247192,32.8423118591309,...,1.79769313486232e+308,1.79769313486232e+308,0.397731989622116,0.510950982570648,0.276378005743027,0.448664993047714,4.81496095657349,0.49244299530983,-0.23735000193119,-0.220809996128082


In [118]:
dfxpopn = dfx[['popn']]
dfxpopn.head(7)

Unnamed: 0,popn
168,16560000.0
109,9388140.0
204,2435000.0
247,8951460.0
99,49337300.0
228,1689570.0
172,19302100.0
