# Case study

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Putting-it-all-together" data-toc-modified-id="Putting-it-all-together-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Putting it all together</a></span></li><li><span><a href="#Initial-impressions-of-the-data" data-toc-modified-id="Initial-impressions-of-the-data-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Initial impressions of the data</a></span><ul class="toc-item"><li><span><a href="#Regular-Expression" data-toc-modified-id="Regular-Expression-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span><a href="https://zh.wikipedia.org/wiki/%E6%AD%A3%E5%88%99%E8%A1%A8%E8%BE%BE%E5%BC%8F" target="_blank">Regular Expression</a></a></span></li></ul></li><li><span><a href="#Merge-data" data-toc-modified-id="Merge-data-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Merge data</a></span></li></ul></div>

## Putting it all together

- Use the techniques you’ve learned on Gapminder data 
- Clean and tidy data saved to a file
    - Ready to be loaded for analysis!
- Dataset consists of life expectancy by country and year
- Data will come in multiple parts
    - Load
    - Preliminary quality diagnosis
    - Combine into single dataset
- Useful methods
        In [1]: import pandas as pd
        In [2]: df = pd.read_csv('my_data.csv')
        In [3]: df.head()
        In [4]: df.info()
        In [5]: df.columns
        In [6]: df.describe()
        In [7]: df.column.value_counts()
        In [8]: df.column.plot('hist')
- Data quality
        In [9]: def cleaning_function(row_data):
           ...:     # data cleaning steps
           ...:     return ...
        In [10]: df.apply(cleaning_function, axis=1)
        In [11]: assert (df.column_data > 0).all()
- Combining data
        pd.merge(df1, df2, ...) 
        pd.concat([df1, df2, df3, ...])

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import jupyterthemes.jtplot as jtplot
%matplotlib inline
jtplot.style(theme='onedork')

In [8]:
gapminder = pd.read_csv('exercise/gapminder.csv', index_col=0)
print(gapminder.head(3))
print(gapminder.info())
#print(gapminder.columns)
#print(gapminder.describe())
print(gapminder.shape)

    1800  1801   1802   1803   1804   1805   1806   1807   1808   1809  ...  \
0    NaN   NaN    NaN    NaN    NaN    NaN    NaN    NaN    NaN    NaN  ...   
1  28.21  28.2  28.19  28.18  28.17  28.16  28.15  28.14  28.13  28.12  ...   
2    NaN   NaN    NaN    NaN    NaN    NaN    NaN    NaN    NaN    NaN  ...   

   2008  2009  2010  2011  2012  2013  2014  2015  2016        Life expectancy  
0   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN               Abkhazia  
1   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN            Afghanistan  
2   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN  Akrotiri and Dhekelia  

[3 rows x 218 columns]
<class 'pandas.core.frame.DataFrame'>
Int64Index: 780 entries, 0 to 259
Columns: 218 entries, 1800 to Life expectancy
dtypes: float64(217), object(1)
memory usage: 1.3+ MB
None
(780, 218)


In [3]:
#為了練習的處理
#Life expectancy 更改 loc
gapminder = pd.read_csv('exercise/gapminder.csv', index_col=0)
life_exp_loc = gapminder.columns.get_loc('Life expectancy')
if life_exp_loc != 0:
    gapminder = gapminder.rename(columns={'Life expectancy' : 'Old_Life_expectancy'})
    gapminder.insert(loc=0, column='Life expectancy', value=gapminder.iloc[:,life_exp_loc])
    gapminder = gapminder.drop(columns = 'Old_Life_expectancy')
# 分段
g1800s, g1900s, g2000s = gapminder[['Life expectancy']], gapminder[['Life expectancy']], gapminder[['Life expectancy']]
for year in list(gapminder.columns):
    if year[0:2] == '18':
        g1800s = pd.concat([g1800s, gapminder[year]], axis=1)
    elif year[0:2] == '19':
        g1900s = pd.concat([g1900s, gapminder[year]], axis=1)
    elif year[0:2] == '20':
        g2000s = pd.concat([g2000s, gapminder[year]], axis=1)
#print(g1800s.info())
#print(g1900s.info())
#print(g2000s.info())

In [4]:
def check_null_or_valid(row_data):
    """Function that takes a row of data,
    drops all missing values,
    and checks if all remaining values are greater than or equal to 0
    """
    no_na = row_data.dropna()
    numeric = pd.to_numeric(no_na)
    ge0 = numeric >= 0
    return ge0

# Check whether the first column is 'Life expectancy'
assert gapminder.columns[0] == 'Life expectancy'

# Check whether the values in the row are valid
assert gapminder.iloc[:, 1:].apply(check_null_or_valid, axis=1).all().all()

# Check that there is only one instance of each country
#assert gapminder['Life expectancy'].value_counts()[0] == 1
print(g1800s['Life expectancy'].value_counts())
#g1800s = g1800s.drop_duplicates()
#g1800s['Life expectancy'].value_counts()

Papua New Guinea        3
Bangladesh              3
Christmas Island        3
Malawi                  3
USSR                    3
                       ..
United Arab Emirates    3
Paraguay                3
St. Helena              3
India                   3
Kosovo                  3
Name: Life expectancy, Length: 260, dtype: int64


In [5]:
# Concatenate the DataFrames column-wise
gapminder = pd.concat([g1800s, g1900s, g2000s], axis =1)
# Print the head of gapminder
print(gapminder.info())
# dtypes = object, 由1變3
# pd.concat 沒有合併 Life expectancy

gapminder = pd.concat([g1800s, g1900s.drop(columns='Life expectancy'), 
                       g2000s.drop(columns='Life expectancy')], axis =1)
print(gapminder.info())

<class 'pandas.core.frame.DataFrame'>
Int64Index: 780 entries, 0 to 259
Columns: 220 entries, Life expectancy to 2016
dtypes: float64(217), object(3)
memory usage: 1.3+ MB
None
<class 'pandas.core.frame.DataFrame'>
Int64Index: 780 entries, 0 to 259
Columns: 218 entries, Life expectancy to 2016
dtypes: float64(217), object(1)
memory usage: 1.3+ MB
None


## Initial impressions of the data

- Principles of tidy data
    - Rows form observations
    - Columns form variables
    - Tidying data will make data cleaning easier 
    - Melting turns columns into rows
    - Pivot will take unique values from a column and create new columns
- Checking data types
        In [1]: df.dtypes
        In [2]: df['column'] = df['column'].to_numeric()
        In [3]: df['column'] = df['column'].astype(str)
- Additional calculations and saving your data
        In [4]: df['new_column'] = df['column_1'] + df['column_2']
        In [5]: df['new_column'] = df.apply(my_function, axis=1)
        In [6]: df.to_csv['my_data.csv']

In [6]:
# Tidying data
# Melt gapminder: gapminder_melt
gapminder_melt = pd.melt(frame=gapminder, id_vars= 'Life expectancy')

# Rename the columns
gapminder_melt.columns = ['country', 'year', 'life_expectancy']

# Print the head of gapminder_melt
print(gapminder_melt.head())

                 country  year  life_expectancy
0               Abkhazia  1800              NaN
1            Afghanistan  1800            28.21
2  Akrotiri and Dhekelia  1800              NaN
3                Albania  1800            35.40
4                Algeria  1800            28.82


In [9]:
#Checking the data types
print(gapminder_melt.dtypes)

# Convert the year column to numeric
gapminder_melt.year = pd.to_numeric(gapminder_melt.year, errors='coerce')
print(gapminder_melt.dtypes)
# Test if country is of type object
assert gapminder_melt.country.dtypes == np.object
# Test if year is of type int64
assert gapminder_melt.year.dtype == np.int64
# Test if life_expectancy is of type float64
assert gapminder_melt.life_expectancy.dtype == np.float64


country             object
year                 int64
life_expectancy    float64
dtype: object
country             object
year                 int64
life_expectancy    float64
dtype: object


In [14]:
# Create the series of countries: countries
countries = gapminder_melt.country

# Drop all the duplicates from countries
countries = countries.drop_duplicates()

# Write the regular expression: pattern
pattern = '^[A-Za-z\.\s]*$'

# Create the Boolean vector: mask
# elements in countries ONLY contains pattern
mask = countries.str.contains(pattern)

# Invert the mask: mask_inverse
mask_inverse = ~mask

# Subset countries using mask_inverse: invalid_countries
invalid_countries = countries.loc[mask_inverse]

# Print invalid_countries
print(invalid_countries)


49            Congo, Dem. Rep.
50                 Congo, Rep.
53               Cote d'Ivoire
73      Falkland Is (Malvinas)
93               Guinea-Bissau
98            Hong Kong, China
118    United Korea (former)\n
131               Macao, China
132             Macedonia, FYR
145      Micronesia, Fed. Sts.
161            Ngorno-Karabakh
187             St. Barthélemy
193     St.-Pierre-et-Miquelon
225                Timor-Leste
251      Virgin Islands (U.S.)
252       North Yemen (former)
253       South Yemen (former)
258                      Åland
Name: country, dtype: object
0       True
1       True
2       True
3       True
4       True
       ...  
255     True
256     True
257     True
258    False
259     True
Name: country, Length: 260, dtype: bool


### [Regular Expression](https://zh.wikipedia.org/wiki/正则表达式)

## Merge data