# Pandas for everyone chapter 01
The author's [github page](https://github.com/chendaniely/pandas_for_everyone)

### Pandas data types
Pandas has its own data types

Pandas Type | Python Type
------------|------------
object | string
datetime64 | datetime
int64 | int
float64 | float

In [1]:
import pandas as pd

In [2]:
df = pd.read_csv('data/gapminder.txt', sep='\t')
df.dtypes # show the column types

country       object
continent     object
year           int64
lifeExp      float64
pop            int64
gdpPercap    float64
dtype: object

In [3]:
df.head()

Unnamed: 0,country,continent,year,lifeExp,pop,gdpPercap
0,Afghanistan,Asia,1952,28.801,8425333,779.445314
1,Afghanistan,Asia,1957,30.332,9240934,820.85303
2,Afghanistan,Asia,1962,31.997,10267083,853.10071
3,Afghanistan,Asia,1967,34.02,11537966,836.197138
4,Afghanistan,Asia,1972,36.088,13079460,739.981106


### Get the first row

In [4]:
df.loc[99] # get the row with index 99, actually the 100th row

country      Bangladesh
continent          Asia
year               1967
lifeExp          43.453
pop            62821884
gdpPercap       721.186
Name: 99, dtype: object

### Get the last row

In [5]:
# df.loc[-1] # doesn't work, it will attemtps to get row with index -1

df.iloc[-1] # it works because .iloc() works by positions, -1 means the last one.

country      Zimbabwe
continent      Africa
year             2007
lifeExp        43.487
pop          12311143
gdpPercap     469.709
Name: 1703, dtype: object

In [6]:
df.loc[df.shape[0] - 1] # df.shape gives (#rows, #collumns), use that to get the index of the last row

country      Zimbabwe
continent      Africa
year             2007
lifeExp        43.487
pop          12311143
gdpPercap     469.709
Name: 1703, dtype: object

Or we can get the last row using tail()

In [7]:
df.tail(1)

Unnamed: 0,country,continent,year,lifeExp,pop,gdpPercap
1703,Zimbabwe,Africa,2007,43.487,12311143,469.709298


### What's the difference
We notice that df.tail(1) and df.loc\[df.shape[0] - 1\] print out differently, because their types are different.

In [8]:
type(df.tail(1))

pandas.core.frame.DataFrame

In [9]:
type(df.loc[df.shape[0] - 1])

pandas.core.series.Series

## Subsetting data
We can use .loc and .iloc attributes to subset data. The syntax is

.loc\[\[rows\], \[columns\]\]
.iloc\[\[rows\], \[columns\]\]

The difference is .loc works with index and .iloc works with position

In [10]:
df.loc[[0, 9, 99], ['country', 'year', 'lifeExp']] # get the first, 10th and 100th rows, with the 3 selected columns

Unnamed: 0,country,year,lifeExp
0,Afghanistan,1952,28.801
9,Afghanistan,1997,41.763
99,Bangladesh,1967,43.453


In [11]:
df.loc[:, ['country', 'year', 'lifeExp']] # get all the rows, but just the 3 selected columns, ':' means all

Unnamed: 0,country,year,lifeExp
0,Afghanistan,1952,28.801
1,Afghanistan,1957,30.332
2,Afghanistan,1962,31.997
3,Afghanistan,1967,34.020
4,Afghanistan,1972,36.088
...,...,...,...
1699,Zimbabwe,1987,62.351
1700,Zimbabwe,1992,60.377
1701,Zimbabwe,1997,46.809
1702,Zimbabwe,2002,39.989


## Grouped and aggregated calculations
For example, we want to know average statistics per year.

### Calculate average

In [12]:
type(df.groupby('year'))

pandas.core.groupby.generic.DataFrameGroupBy

In [13]:
df.groupby('year').mean() # computes all columns' average (for those columns that average make sense)

Unnamed: 0_level_0,lifeExp,pop,gdpPercap
year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1952,49.05762,16950400.0,3725.276046
1957,51.507401,18763410.0,4299.408345
1962,53.609249,20421010.0,4725.812342
1967,55.67829,22658300.0,5483.653047
1972,57.647386,25189980.0,6770.082815
1977,59.570157,27676380.0,7313.166421
1982,61.533197,30207300.0,7518.901673
1987,63.212613,33038570.0,7900.920218
1992,64.160338,35990920.0,8158.608521
1997,65.014676,38839470.0,9090.175363


In [14]:
df.groupby('year')['lifeExp'].mean()

year
1952    49.057620
1957    51.507401
1962    53.609249
1967    55.678290
1972    57.647386
1977    59.570157
1982    61.533197
1987    63.212613
1992    64.160338
1997    65.014676
2002    65.694923
2007    67.007423
Name: lifeExp, dtype: float64

In [15]:
df.groupby('year')[['lifeExp', 'gdpPercap']].mean() # choose two columns

Unnamed: 0_level_0,lifeExp,gdpPercap
year,Unnamed: 1_level_1,Unnamed: 2_level_1
1952,49.05762,3725.276046
1957,51.507401,4299.408345
1962,53.609249,4725.812342
1967,55.67829,5483.653047
1972,57.647386,6770.082815
1977,59.570157,7313.166421
1982,61.533197,7518.901673
1987,63.212613,7900.920218
1992,64.160338,8158.608521
1997,65.014676,9090.175363


In [16]:
df.groupby(['year', 'continent'])[['lifeExp', 'gdpPercap']].mean() # group by year and continent

Unnamed: 0_level_0,Unnamed: 1_level_0,lifeExp,gdpPercap
year,continent,Unnamed: 2_level_1,Unnamed: 3_level_1
1952,Africa,39.1355,1252.572466
1952,Americas,53.27984,4079.062552
1952,Asia,46.314394,5195.484004
1952,Europe,64.4085,5661.057435
1952,Oceania,69.255,10298.08565
1957,Africa,41.266346,1385.236062
1957,Americas,55.96028,4616.043733
1957,Asia,49.318544,5787.73294
1957,Europe,66.703067,6963.012816
1957,Oceania,70.295,11598.522455


In [17]:
df.groupby(['year', 'continent'])[['lifeExp', 'gdpPercap']].mean().reset_index() # year and continent no longer index

Unnamed: 0,year,continent,lifeExp,gdpPercap
0,1952,Africa,39.1355,1252.572466
1,1952,Americas,53.27984,4079.062552
2,1952,Asia,46.314394,5195.484004
3,1952,Europe,64.4085,5661.057435
4,1952,Oceania,69.255,10298.08565
5,1957,Africa,41.266346,1385.236062
6,1957,Americas,55.96028,4616.043733
7,1957,Asia,49.318544,5787.73294
8,1957,Europe,66.703067,6963.012816
9,1957,Oceania,70.295,11598.522455


### Count frequencies

In [18]:
df.groupby('continent')['country'].nunique() # No. of countries per continent (nunique(): list -> int)

continent
Africa      52
Americas    25
Asia        33
Europe      30
Oceania      2
Name: country, dtype: int64

In [19]:
df.groupby('continent')['country'].unique() # show countries per continent (unique(): list -> list)

continent
Africa      [Algeria, Angola, Benin, Botswana, Burkina Fas...
Americas    [Argentina, Bolivia, Brazil, Canada, Chile, Co...
Asia        [Afghanistan, Bahrain, Bangladesh, Cambodia, C...
Europe      [Albania, Austria, Belgium, Bosnia and Herzego...
Oceania                              [Australia, New Zealand]
Name: country, dtype: object

Unnamed: 0,country,continent,year,lifeExp,pop,gdpPercap
0,Afghanistan,Asia,1952,28.801,8425333,779.445314
1,Afghanistan,Asia,1957,30.332,9240934,820.853030
2,Afghanistan,Asia,1962,31.997,10267083,853.100710
3,Afghanistan,Asia,1967,34.020,11537966,836.197138
4,Afghanistan,Asia,1972,36.088,13079460,739.981106
...,...,...,...,...,...,...
1699,Zimbabwe,Africa,1987,62.351,9216418,706.157306
1700,Zimbabwe,Africa,1992,60.377,10704340,693.420786
1701,Zimbabwe,Africa,1997,46.809,11404948,792.449960
1702,Zimbabwe,Africa,2002,39.989,11926563,672.038623
