## Reading and exploring the data

Additional info: [How to master Python’s main data analysis library in 20 Minutes](https://towardsdatascience.com/how-to-master-pandas-8514f33f00f6)

In [None]:
import pandas as pd

1. You can load data from a local file like this:

        data = pd.read_csv('happiness_with_continent.csv')


2. Or you can read data directly from the web like this:

        data = pd.read_csv('https://.../happiness_with_continent.csv')
        

3. You can read files from internet (Google), Excel, etc: xlmx, csv, txt, sql, html, json, etc...

4. Check also [pandas-datareader](https://pandas-datareader.readthedocs.io/en/latest/index.html), a nice library that allows to read up-to-date data from multiple sources.

In [None]:
data = pd.read_csv('https://raw.githubusercontent.com/FBosler/you-datascientist/master/happiness_with_continent.csv')

In [None]:
data

### Inicial data exploration

Some interesting methods and properties that you can use for inicial data exploration

    data.shape      returns the dimensions of the DataFrame
    data.index
    data.columns
    data.info()     Information about the data (data types and non-null records)
    data.describe() Descriptive statistics about the data (numeric data only)
    data.sample(n)  n random rows


In [None]:
data.index

In [None]:
data.columns

In [None]:
data.info()

In [None]:
data.describe()

In [None]:
# selecting a column
data['Year']

In [None]:
# selecting more than one column
data[['Country name','Life Ladder']]

In [None]:
# implicit index: selecting a line 
data.iloc[0]

In [None]:
# implicit index: selecting a range of lines
data.iloc[0:5]

In [None]:
# implicit index: selecting lines and columns
data.iloc[0:5, [1,2]] 

In [None]:
# explicit index: selecting a line or range of lines
data.loc[0:3]

In [None]:
# implicit index: selecting lines and columns
data.loc[1:3, ["Country name", "Year", "Continent"]]

### Other relevant methods

    data.sort_values(by = ?) sorts the data by a column or list of columns 
    data.set_index(col)      sets a new index of the DataFrame. 
                             the option inplace=True changes the actual DataFrame
    data.reset_index()       restores the default index

In [None]:
data.sort_values(by='Year')
data.sort_values(by=['Country name','Year'])
data.sort_values(by=['Year','Country name'])

In [None]:
data.set_index('Country name',inplace=True)

In [None]:
data.sample(5)

In [None]:
data.reset_index()

In [None]:
data.min()

In [None]:
# Get a sub dataframe for a given criteria
life_condition = data['Life Ladder']  > 4
year_condition = data['Year']  > 2014
gen_condition = data['Generosity'] > .4

data.loc[life_condition & year_condition & gen_condition]

### Analytical functions

    max, min
    sum
    mean, median, quartile
    idxmax, idxmin: returns the index of the row where the first minimum/maximum is found

In [None]:
data.min()

In [None]:
data[["Year", "Positive affect"]].min()

In [None]:
data["Positive affect"].idxmax()

### groupby
So far, all the calculations we have applied were to the entire set, a row, or a column, but we can also group our data and calculate metrics for the individual groups.

In [None]:
data.groupby(['Country name'])['Life Ladder'].max()

Let’s say we want per year the country with the highest Life Ladder.

In [None]:
data.groupby(['Year'])['Life Ladder'].idxmax()