# In the previous episode...

### learned to read and write files
```python
with open("old_shiny_file.txt", "r") as fh:
    content = fh.read()
    
with open("new_shiny_file.txt", "w") as fh:
    fh.write("hey, this is cool!!\n" * 20)
```

and implemented an "Ishmael" counter

# .. and now

# Introduction to Data Analysis
We'll download a dataset ([original source](https://www.kaggle.com/unsdsn/world-happiness)) and visualise it.

In [None]:
# Don't worry about the code in this cell for now, we'll get to this stuff a future lesson
import urllib.request   
import zipfile
import os

urllib.request.urlretrieve("https://raw.githubusercontent.com/gabrielecalvo/Language4Water/master/assets/HappinessReport.zip", 'HappinessReport.zip')
print("downloaded")

with zipfile.ZipFile('HappinessReport.zip', 'r') as zip_ref:
    zip_ref.extractall("HappinessReport")
print("unzipped")
    
os.remove("HappinessReport.zip") 
print("removed zipped file")

## loading data from a single csv

In [None]:
import pandas as pd
from pathlib import Path

data_folder = Path("HappinessReport")
list_of_csvs = list(data_folder.glob("*.csv"))

list_of_csvs

In [None]:
df = pd.read_csv(list_of_csvs[1]) # taking 2016 as sample
df

In [None]:
# re-loading using the first column as index
df = pd.read_csv(list_of_csvs[1], index_col=0)
df

### selecting

In [None]:
# selecting by row
df.loc['Switzerland']  # by label
df.iloc[1]             # by index

In [None]:
# select by column(s)
df['Region']              # returns a single column (Series)
df.loc[:, 'Region']       # equivalent longer form

df[['Region']]            # returns a table (DataFrame) with one column 
df[['Region', 'Family']]  # returns a table (DataFrame) with 2 columns

In [None]:
# select by both row and column
df.loc['Switzerland', 'Region'] 

### sorting

In [None]:
df.sort_index()                             # sort by index (Country)
df.sort_values('Freedom')                   # sort by specific column
df.sort_values('Freedom', ascending=False)  # sort by specific column but in reverse order

In [None]:
# group by
df.groupby("Region").mean()

# Plotting
Below are just some examples.
If interested, I suggest you look at the well written official documentation: https://pandas.pydata.org/docs/user_guide/visualization.html

pandas Series (individual column objects) and pandas Dataframes (data tables) had `.plot` method that allows quick and efficient plotting of data.
By default, calling `.plot()` is going to call `.plot.line()`, but there are [many other types of plots](https://pandas.pydata.org/docs/user_guide/visualization.html) supported.

In [None]:
pd.Series({
    "hi": 5,
    "how": 2,
    "are": 0,
    "you": 10
}).plot()  # same as `.plot.line()`

## plotting relationships

In [None]:
df.plot.scatter(                     # creating a scatterplot of
    x='Economy (GDP per Capita)',    # using "GDP/capita" on the horizontal axis
    y='Generosity',                  # using "Generosity" on the vertical axis 
)

In [None]:
import matplotlib.pyplot as plt
fig, ax = plt.subplots()

df.plot.scatter(                     # creating a scatterplot of
    x='Economy (GDP per Capita)',    # using "GDP/capita" on the horizontal axis
    y='Generosity',                  # using "Generosity" on the vertical axis 
    c='Health (Life Expectancy)',    # color it by "Health (Life Expectancy)"
    cmap='rainbow',                  # using a rainbow color spectrum
    grid=True,                       # plotting the grid underneath
    ax=ax,                           # use the previously defined axis (otherwise the x-axis label will not show, known bug)
) 

### Group Bar Plot

In [None]:
# grouping by region, taking the average ...
regional_means = df.groupby("Region").mean()

# ... and plotting the Happiness Score as a bar plot
ax = regional_means['Happiness Score'].plot.bar()

### saving the figure to file

In [None]:
ax.figure.savefig('myplot.jpeg', bbox_inches="tight") # saving the file... without cutting out the lables

# For more..
If you find data analysis with pandas interesting and/or useful, I created an open tutorial with exercises along the way that you might find useful:
[https://github.com/gabrielecalvo/pandas_tutorial](https://github.com/gabrielecalvo/pandas_tutorial#pandas-tutorial)

# Exercise
Load data from each year and plot the evolution of the `Happiness Score` of your favourite country over the years

In [None]:
...

### possible solution

In [None]:
my_country = "Italy"

pd.Series({
    "2015": pd.read_csv('HappinessReport/2015.csv', index_col=0).loc[my_country, 'Happiness Score'],
    "2016": pd.read_csv('HappinessReport/2016.csv', index_col=0).loc[my_country, 'Happiness Score'],
    "2017": pd.read_csv('HappinessReport/2017.csv', index_col=0).loc[my_country, 'Happiness.Score'],
    "2018": pd.read_csv('HappinessReport/2018.csv', index_col=1).loc[my_country, 'Score'],
    "2019": pd.read_csv('HappinessReport/2019.csv', index_col=1).loc[my_country, 'Score'],
}).plot(title='Happiness Score')

In [None]:
pd.Series({
    "2015": pd.read_csv('HappinessReport/2015.csv', index_col=0).loc[my_country, 'Happiness Rank'],
    "2016": pd.read_csv('HappinessReport/2016.csv', index_col=0).loc[my_country, 'Happiness Rank'],
    "2017": pd.read_csv('HappinessReport/2017.csv', index_col=0).loc[my_country, 'Happiness.Rank'],
    "2018": pd.read_csv('HappinessReport/2018.csv', index_col=1).loc[my_country, 'Overall rank'],
    "2019": pd.read_csv('HappinessReport/2019.csv', index_col=1).loc[my_country, 'Overall rank'],
}).plot(title='Happiness Rank')