# Pu!ing it all together

### Puting it all together
- Use the techniques you’ve learned on Gapminder data
- Clean and tidy data saved to a file
    - Ready to be loaded for analysis!
- Dataset consists of life expectancy by country and year
- Data will come in multiple parts
    - Load
    - Preliminary quality diagnosis
    - Combine into single dataset

### Useful methods
```python
In [1]: import pandas as pd
In [2]: df = pd.read_csv('my_data.csv')
In [3]: df.head()
In [4]: df.info()
In [5]: df.columns
In [6]: df.describe()
In [7]: df.column.value_counts()
In [8]: df.column.plot('hist')
    
```

### Data quality
```python
In [9]: def cleaning_function(row_data):
...: # data cleaning steps
...: return ...
In [10]: df.apply(cleaning_function, axis=1)  # axis=1 is row wise
In [11]: assert (df.column_data > 0).all()
    
```

### Combining data

- pd.merge(df1, df2, …)
- pd.concat([df1, df2, df3, …])

---
# Let’s practice!

```python
# Import matplotlib.pyplot
import matplotlib.pyplot as plt

# Create the scatter plot
g1800s.plot(kind='scatter', x='1800', y='1899')

# Specify axis labels
plt.xlabel('Life Expectancy by Country in 1800')
plt.ylabel('Life Expectancy by Country in 1899')

# Specify axis limits
plt.xlim(20, 55)
plt.ylim(20, 55)

# Display the plot
plt.show()```

---

```python
def check_null_or_valid(row_data):
    """Function that takes a row of data,
    drops all missing values,
    and checks if all remaining values are greater than or equal to 0
    """
    no_na = row_data.dropna()[1:-1]
    numeric = pd.to_numeric(no_na)
    ge0 = numeric >= 0
    return ge0

# Check whether the first column is 'Life expectancy'
assert g1800s.columns[0] == 'Life expectancy'

# Check whether the values in the row are valid
assert g1800s.iloc[:, 1:].apply(check_null_or_valid, axis=1).all().all()

# Check that there is only one instance of each country
assert g1800s['Life expectancy'].value_counts()[0] == 1```

---
```python
# Concatenate the DataFrames row-wise
gapminder = pd.concat([g1800s, g1900s, g2000s])

# Print the shape of gapminder
print(gapminder.shape)

# Print the head of gapminder
print(gapminder.head())
```

# Initial impressions of the data


### Principles of tidy data
- Rows form observations
- Columns form variables
- Tidying data will make data cleaning easier
- Melting turns columns into rows
- Pivot will take unique values from a column and create new columns

### Checking data types

```python
In [1]: df.dtypes
In [2]: df['column'] = df['column'].to_numeric()
In [3]: df['column'] = df['column'].astype(str)
    
    ```

### Additional calculations and saving your data
```python
In [4]: df['new_column'] = df['column_1'] + df['column_2']
In [5]: df['new_column'] = df.apply(my_function, axis=1)
In [6]: df.to_csv['my_data.csv']
    
    ```

---

```python
# Melt gapminder: gapminder_melt
gapminder_melt = pd.melt(frame=gapminder, id_vars='Life expectancy')

# Rename the columns
gapminder_melt.columns = ['country', 'year', 'life_expectancy']

# Print the head of gapminder_melt
print(gapminder_melt.head())
```

---
```python
# Convert the year column to numeric
gapminder.year = pd.to_numeric(gapminder['year'], errors='coerce')

# Test if country is of type object
assert gapminder.country.dtypes == np.object

# Test if year is of type int64
assert gapminder.year.dtypes == np.int64

# Test if life_expectancy is of type float64
assert gapminder.life_expectancy.dtypes == np.float64


```


---
```python
# Create the series of countries: countries
countries = gapminder['country']

# Drop all the duplicates from countries
countries = countries.drop_duplicates()

# Write the regular expression: pattern
pattern = '^[A-Za-z\.\s]*$'

# Create the Boolean vector: mask
mask = countries.str.contains(pattern)

# Invert the mask: mask_inverse
mask_inverse = ~mask

# Subset countries using mask_inverse: invalid_countries
invalid_countries = countries.loc[mask_inverse]

# Print invalid_countries
print(invalid_countries)

```

---
```python

# Assert that country does not contain any missing values
assert pd.notnull(gapminder.country).all()

# Assert that year does not contain any missing values
assert pd.notnull(gapminder.year).all()

# Drop the missing values
gapminder = gapminder.dropna(axis=0, how='any')

# Print the shape of gapminder
print(gapminder.shape)

```


### Wrapping up
Now that you have a clean and tidy dataset, you can do a bit of visualization and aggregation. In this exercise, you'll begin by creating a histogram of the life_expectancy column. You should not get any values under 0 and you should see something reasonable on the higher end of the life_expectancy age range.

Your next task is to investigate how average life expectancy changed over the years. To do this, you need to subset the data by each year, get the life_expectancy column from each subset, and take an average of the values. You can achieve this using the .groupby() method. This .groupby() method is covered in greater depth in Manipulating DataFrames with pandas.

Finally, you can save your tidy and summarized DataFrame to a file using the .to_csv() method.

Matplotlib and pandas have been pre-imported as plt and pd. Go for it!

---
- Create a histogram of the `life_expectancy` column using the `.plot()` method of gapminder. Specify `kind='hist'`.
- Group gapminder by 'year' and aggregate 'life_expectancy' by the mean. To do this:
    - Use the `.groupby()` method on gapminder with 'year' as the argument. Then select 'life_expectancy' and chain the `.mean()` method to it.
- Print the head and tail of gapminder_agg. This has been done for you.
- Create a line plot of average life expectancy per year by using the `.plot() `method (without any arguments) on `gapminder_agg`.
- Save gapminder and gapminder_agg to csv files called 'gapminder.csv' and 'gapminder_agg.csv', respectively, using the `.to_csv()` method.

```python
# Add first subplot
plt.subplot(2, 1, 1) 

# Create a histogram of life_expectancy
gapminder.life_expectancy.plot(kind='hist')

# Group gapminder: gapminder_agg
gapminder_agg = gapminder.groupby('year')['life_expectancy'].mean()

# Print the head of gapminder_agg
print(gapminder_agg.head())

# Print the tail of gapminder_agg
print(gapminder_agg.tail())

# Add second subplot
plt.subplot(2, 1, 2)

# Create a line plot of life expectancy per year
gapminder_agg.plot()

# Add title and specify axis labels
plt.title('Life expectancy over the years')
plt.ylabel('Life expectancy')
plt.xlabel('Year')

# Display the plots
plt.tight_layout()
plt.show()

# Save both DataFrames to csv files
gapminder.to_csv('gapminder.csv')
gapminder_agg.to_csv('gapminder_agg.csv')```

# You’ve learned how to…
- Load and view data in pandas
- Visually inspect data for errors and potential problems
- Tidy data for analysis and reshape it
- Combine datasets
- Clean data by using regular expressions and
functions
- Test your data and be proactive in finding
potential errors