## DATA CLEANSING AND MISSING DATA (lesson 2)

1. Rename columns: use the method from the example above to properly name any columns that seem mislabeled in the `population` dataset. The `population` dataset was given in the EDA lesson warmer

2. Missing data: first check and see which and how much data is missing in the `population` dataset

3. Remove missing data: drop all observations with missing data

4. Filter for relevant data: filter the dataset that it begins with the year 1950

5. Make data persistant: save the dataset as a `.csv` file in your `data` folder as they will be used for the week’s project

6. Repeat for the the **life_expectancy**, and **fertility_rate** datasets which are available below

**Hint:** one of the files is not a `.csv` and must be read in using a pandas function other than `read_csv()`

Related files
fertility_rate.csv (1 MB)
life_expectancy.xls (3 MB)

In [1]:
import pandas as pd

In [2]:
#Rename columns: use the method from the example above to properly name any columns that seem mislabeled 
#... in the population dataset. The population dataset was given in the EDA lesson warmer.

df_pop = pd.read_csv('../data/population.csv')
df_pop.head()

Unnamed: 0,Total population,year,population
0,Abkhazia,1800,
1,Afghanistan,1800,3280000.0
2,Akrotiri and Dhekelia,1800,
3,Albania,1800,410445.0
4,Algeria,1800,2503218.0


In [3]:
df_pop.columns

Index(['Total population', 'year', 'population'], dtype='object')

In [4]:
df_pop.rename(columns={'Total population':'country'},inplace=True)
df_pop.columns

Index(['country', 'year', 'population'], dtype='object')

In [5]:
# Missing data: first check and see which and how much data is missing in the population dataset

#check which column(s) contain NaN values
df_pop.isnull().sum()

country          0
year             0
population    2099
dtype: int64

In [6]:
#view as a DF

df_pop[df_pop['population'].isnull()]

Unnamed: 0,country,year,population
0,Abkhazia,1800,
2,Akrotiri and Dhekelia,1800,
45,Christmas Island,1800,
46,Cocos Island,1800,
58,Czechoslovakia,1800,
...,...,...,...
22270,Northern Marianas,2015,
22271,South Georgia and the South Sandwich Islands,2015,
22272,US Minor Outlying Islands,2015,
22273,Virgin Islands,2015,


In [7]:
#Remove missing data: drop all observations with missing data

df_pop = df_pop.dropna(axis=0)
df_pop

Unnamed: 0,country,year,population
1,Afghanistan,1800,3280000.0
3,Albania,1800,410445.0
4,Algeria,1800,2503218.0
5,American Samoa,1800,8170.0
6,Andorra,1800,2654.0
...,...,...,...
22256,Zambia,2015,16211767.0
22257,Zimbabwe,2015,15602751.0
22259,South Sudan,2015,12339812.0
22260,Curaçao,2015,157203.0


In [8]:
#Filter for relevant data: filter the dataset that it begins with the year 1950

df_pop_aft1950 = df_pop[df_pop['year'] >= 1950]
df_pop_aft1950.shape

(16741, 3)

In [None]:
# Make data persistant: save the dataset as a .csv file in your data folder as 
#... they will be used for the week’s project

df_pop_aft1950.to_csv('../data/population_after_1950.csv')

In [None]:
#Repeat for the the life_expectancy, and fertility_rate datasets which are available below

In [4]:
#Rename columns: use the method from the example above to properly name any columns that seem mislabeled 
#... in the population dataset. The population dataset was given in the EDA lesson warmer.

df_fert = pd.read_csv('../data/fertility_rate.csv')
df_fert.describe

<bound method NDFrame.describe of         Total fertility rate  year  fertility
0                   Abkhazia  1800        NaN
1                Afghanistan  1800       7.00
2      Akrotiri and Dhekelia  1800        NaN
3                    Albania  1800       4.60
4                    Algeria  1800       6.99
...                      ...   ...        ...
56154                  Yemen  2015       3.83
56155             Yugoslavia  2015        NaN
56156                 Zambia  2015       5.59
56157               Zimbabwe  2015       3.35
56158                  Åland  2015        NaN

[56159 rows x 3 columns]>

In [10]:
df_fert.columns

Index(['Total fertility rate', 'year', 'fertility'], dtype='object')

In [11]:
df_fert.rename(columns={'Total fertility rate':'country'},inplace=True)
df_pop.columns

Index(['country', 'year', 'population'], dtype='object')

In [12]:
# Missing data: first check and see which and how much data is missing in the population dataset

#check which column(s) contain NaN values
df_fert.isnull().sum()

country          0
year             0
fertility    12747
dtype: int64

In [14]:
df_fert[df_fert['fertility'].isnull()]

Unnamed: 0,country,year,fertility
0,Abkhazia,1800,
2,Akrotiri and Dhekelia,1800,
5,American Samoa,1800,
6,Andorra,1800,
8,Anguilla,1800,
...,...,...,...
56148,West Germany,2015,
56152,North Yemen (former),2015,
56153,South Yemen (former),2015,
56155,Yugoslavia,2015,


In [15]:
#Remove missing data: drop all observations with missing data

df_fert = df_fert.dropna(axis=0)
df_fert.shape

(43412, 3)

In [16]:
#Filter for relevant data: filter the dataset that it begins with the year 1950

df_fert_aft1950 = df_fert[df_fert['year'] >= 1950]
df_fert_aft1950.shape


(13262, 3)

In [17]:
# Make data persistant: save the dataset as a .csv file in your data folder as 
#... they will be used for the week’s project

df_fert_aft1950.to_csv('../data/fertility_after_1950.csv')

In [18]:
#Rename columns: use the method from the example above to properly name any columns that seem mislabeled 
#... in the population dataset. The population dataset was given in the EDA lesson warmer.

df_expec = pd.read_excel('../data/life_expectancy.xls')
df_expec.head()

Unnamed: 0,Life expectancy,year,life expectancy
0,Abkhazia,1800,
1,Afghanistan,1800,28.21
2,Akrotiri and Dhekelia,1800,
3,Albania,1800,35.4
4,Algeria,1800,28.82


In [19]:
df_expec.columns

Index(['Life expectancy', 'year', 'life expectancy'], dtype='object')

In [20]:
df_expec.rename(columns={'Life expectancy':'country'},inplace=True)
df_expec.columns

Index(['country', 'year', 'life expectancy'], dtype='object')

In [21]:
# Missing data: first check and see which and how much data is missing in the population dataset

#check which column(s) contain NaN values
df_expec.isnull().sum()

country                0
year                   0
life expectancy    12563
dtype: int64

In [22]:
df_expec[df_expec['life expectancy'].isnull()]

Unnamed: 0,country,year,life expectancy
0,Abkhazia,1800,
2,Akrotiri and Dhekelia,1800,
5,American Samoa,1800,
6,Andorra,1800,
8,Anguilla,1800,
...,...,...,...
56408,West Germany,2016,
56412,North Yemen (former),2016,
56413,South Yemen (former),2016,
56415,Yugoslavia,2016,


In [23]:
#Remove missing data: drop all observations with missing data

df_expec = df_expec.dropna(axis=0)
df_expec

Unnamed: 0,country,year,life expectancy
1,Afghanistan,1800,28.21
3,Albania,1800,35.40
4,Algeria,1800,28.82
7,Angola,1800,26.98
9,Antigua and Barbuda,1800,33.54
...,...,...,...
56411,Virgin Islands (U.S.),2016,80.82
56414,Yemen,2016,64.92
56416,Zambia,2016,57.10
56417,Zimbabwe,2016,61.69


In [25]:
#Filter for relevant data: filter the dataset that it begins with the year 1950

df_expec_aft1950 = df_expec[df_expec['year'] >= 1950]
df_expec_aft1950.shape

(13707, 3)

In [26]:
# Make data persistant: save the dataset as a .csv file in your data folder as 
#... they will be used for the week’s project

df_expec_aft1950.to_csv('../data/expectancy_after_1950.csv')

# Bonus

This section is an excerpt of Wickham, H. (2014). Tidy data. The Journal of Statistical Software, 59, http://www.jstatsoft.org/v59/i10/

It is often said that 80% of data analysis is spent on the cleaning and preparing data. […] The principles of tidy data provide a standard way to organise data values within a dataset. A standard makes initial data cleaning easier because you don’t need to start from scratch and reinvent the wheel every time. […]

>Happy families are all alike; every unhappy family is unhappy in its own way 
— Leo Tolstoy

Like families, tidy datasets are all alike but every messy dataset is messy in its own way. Tidy datasets provide a standardized way to link the structure of a dataset (its physical layout) with its semantics (its meaning). […] A dataset is messy or tidy depending on how rows, columns and tables are matched up with observations, variables and types. In tidy data:

+ Every column is a variable.
+ Every row is an observation.
+ Every cell is a single value.