## Getting Started with pandas
Pandas will be a major tool in the feild of Data Science. It contains data structures and data manipulation tools designed to make data cleaning and analysis fast and easy in Python. pandas is often used in tandem with numerical
computing tools like `NumPy` and `SciPy`, analytical libraries like statsmodels and
`scikit-learn`, and data visualization libraries like `matplotlib`. pandas adopts significant parts of `NumPy’s idiomatic style of array-based computing,` especially array-based functions and a preference for data processing without for loops.

In [1]:
import pandas
# import using alias

In [2]:
pandas.__version__

'0.25.1'

In [7]:
pandas.read_csv("Backup\data\gapminder.tsv", sep='\t')

Unnamed: 0,country,continent,year,lifeExp,pop,gdpPercap
0,Afghanistan,Asia,1952,28.801,8425333,779.445314
1,Afghanistan,Asia,1957,30.332,9240934,820.853030
2,Afghanistan,Asia,1962,31.997,10267083,853.100710
3,Afghanistan,Asia,1967,34.020,11537966,836.197138
4,Afghanistan,Asia,1972,36.088,13079460,739.981106
...,...,...,...,...,...,...
1699,Zimbabwe,Africa,1987,62.351,9216418,706.157306
1700,Zimbabwe,Africa,1992,60.377,10704340,693.420786
1701,Zimbabwe,Africa,1997,46.809,11404948,792.449960
1702,Zimbabwe,Africa,2002,39.989,11926563,672.038623


In [9]:
import pandas as pd
# from pandas import * ## Don't do this

In [10]:
pd.read_csv("Backup\data\gapminder.tsv", sep="\t")

Unnamed: 0,country,continent,year,lifeExp,pop,gdpPercap
0,Afghanistan,Asia,1952,28.801,8425333,779.445314
1,Afghanistan,Asia,1957,30.332,9240934,820.853030
2,Afghanistan,Asia,1962,31.997,10267083,853.100710
3,Afghanistan,Asia,1967,34.020,11537966,836.197138
4,Afghanistan,Asia,1972,36.088,13079460,739.981106
...,...,...,...,...,...,...
1699,Zimbabwe,Africa,1987,62.351,9216418,706.157306
1700,Zimbabwe,Africa,1992,60.377,10704340,693.420786
1701,Zimbabwe,Africa,1997,46.809,11404948,792.449960
1702,Zimbabwe,Africa,2002,39.989,11926563,672.038623


## Series: 
A Series is a `one-dimensional array-like object containing a sequence of values` (of similar types to NumPy types) and an associated array of data labels, called its index.

In [11]:
obj = pd.Series([4, 7, -5, 3])
obj

0    4
1    7
2   -5
3    3
dtype: int64

The `string representation` of a Series displayed interactively shows the index on the
left and the values on the right. Since we did not specify an index for the data, a
default one consisting of the `integers 0 through N - 1` (where N is the length of the
data) is created. You can get the array representation and index object of the Series via
its `values` and `index` attributes, respectively:

In [15]:
obj.values

array([ 4,  7, -5,  3], dtype=int64)

In [16]:
obj.index

RangeIndex(start=0, stop=4, step=1)

In [6]:
## Create own index
obj2 = pd.Series([4, 7, -5, 3], index=['d', 'b', 'a', 'c'])
obj2

d    4
b    7
a   -5
c    3
dtype: int64

In [17]:
df = pd.read_csv("Backup\data\gapminder.tsv", sep="\t")

In [18]:
type(df)

pandas.core.frame.DataFrame

In [20]:
df.shape

(1704, 6)

In [21]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1704 entries, 0 to 1703
Data columns (total 6 columns):
country      1704 non-null object
continent    1704 non-null object
year         1704 non-null int64
lifeExp      1704 non-null float64
pop          1704 non-null int64
gdpPercap    1704 non-null float64
dtypes: float64(2), int64(2), object(2)
memory usage: 80.0+ KB


In [24]:
# shape is an attributes
df.shape

(1704, 6)

In [26]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1704 entries, 0 to 1703
Data columns (total 6 columns):
country      1704 non-null object
continent    1704 non-null object
year         1704 non-null int64
lifeExp      1704 non-null float64
pop          1704 non-null int64
gdpPercap    1704 non-null float64
dtypes: float64(2), int64(2), object(2)
memory usage: 80.0+ KB


In [27]:
# First 5 rows
df.head()

Unnamed: 0,country,continent,year,lifeExp,pop,gdpPercap
0,Afghanistan,Asia,1952,28.801,8425333,779.445314
1,Afghanistan,Asia,1957,30.332,9240934,820.85303
2,Afghanistan,Asia,1962,31.997,10267083,853.10071
3,Afghanistan,Asia,1967,34.02,11537966,836.197138
4,Afghanistan,Asia,1972,36.088,13079460,739.981106


In [28]:
df.tail()

Unnamed: 0,country,continent,year,lifeExp,pop,gdpPercap
1699,Zimbabwe,Africa,1987,62.351,9216418,706.157306
1700,Zimbabwe,Africa,1992,60.377,10704340,693.420786
1701,Zimbabwe,Africa,1997,46.809,11404948,792.44996
1702,Zimbabwe,Africa,2002,39.989,11926563,672.038623
1703,Zimbabwe,Africa,2007,43.487,12311143,469.709298


In [29]:
df.columns

Index(['country', 'continent', 'year', 'lifeExp', 'pop', 'gdpPercap'], dtype='object')

In [30]:
df.index

RangeIndex(start=0, stop=1704, step=1)

In [32]:
df.values

array([['Afghanistan', 'Asia', 1952, 28.801, 8425333, 779.4453145],
       ['Afghanistan', 'Asia', 1957, 30.331999999999997, 9240934,
        820.8530296],
       ['Afghanistan', 'Asia', 1962, 31.997, 10267083, 853.1007099999999],
       ...,
       ['Zimbabwe', 'Africa', 1997, 46.809, 11404948, 792.4499602999999],
       ['Zimbabwe', 'Africa', 2002, 39.989000000000004, 11926563,
        672.0386227000001],
       ['Zimbabwe', 'Africa', 2007, 43.486999999999995, 12311143,
        469.70929810000007]], dtype=object)

In [33]:
df = pd.read_csv("Backup\data\gapminder.tsv", sep="\t", na_values=[99])


# looking at columns in our data

In [34]:
df

Unnamed: 0,country,continent,year,lifeExp,pop,gdpPercap
0,Afghanistan,Asia,1952,28.801,8425333,779.445314
1,Afghanistan,Asia,1957,30.332,9240934,820.853030
2,Afghanistan,Asia,1962,31.997,10267083,853.100710
3,Afghanistan,Asia,1967,34.020,11537966,836.197138
4,Afghanistan,Asia,1972,36.088,13079460,739.981106
...,...,...,...,...,...,...
1699,Zimbabwe,Africa,1987,62.351,9216418,706.157306
1700,Zimbabwe,Africa,1992,60.377,10704340,693.420786
1701,Zimbabwe,Africa,1997,46.809,11404948,792.449960
1702,Zimbabwe,Africa,2002,39.989,11926563,672.038623


In [38]:
df.dtypes

country       object
continent     object
year           int64
lifeExp      float64
pop            int64
gdpPercap    float64
dtype: object

In [39]:
df.head()

Unnamed: 0,country,continent,year,lifeExp,pop,gdpPercap
0,Afghanistan,Asia,1952,28.801,8425333,779.445314
1,Afghanistan,Asia,1957,30.332,9240934,820.85303
2,Afghanistan,Asia,1962,31.997,10267083,853.10071
3,Afghanistan,Asia,1967,34.02,11537966,836.197138
4,Afghanistan,Asia,1972,36.088,13079460,739.981106


In [40]:
country = df['country']
country

0       Afghanistan
1       Afghanistan
2       Afghanistan
3       Afghanistan
4       Afghanistan
           ...     
1699       Zimbabwe
1700       Zimbabwe
1701       Zimbabwe
1702       Zimbabwe
1703       Zimbabwe
Name: country, Length: 1704, dtype: object

In [41]:
type(country)

pandas.core.series.Series

In [45]:
country = df[['country']]
country

Unnamed: 0,country
0,Afghanistan
1,Afghanistan
2,Afghanistan
3,Afghanistan
4,Afghanistan
...,...
1699,Zimbabwe
1700,Zimbabwe
1701,Zimbabwe
1702,Zimbabwe


In [46]:
df[['country', 'pop']]

Unnamed: 0,country,pop
0,Afghanistan,8425333
1,Afghanistan,9240934
2,Afghanistan,10267083
3,Afghanistan,11537966
4,Afghanistan,13079460
...,...,...
1699,Zimbabwe,9216418
1700,Zimbabwe,10704340
1701,Zimbabwe,11404948
1702,Zimbabwe,11926563


# looking at rows in our data

In [24]:
df.head()

Unnamed: 0,country,continent,year,lifeExp,pop,gdpPercap
0,Afghanistan,Asia,1952,28.801,8425333,779.445314
1,Afghanistan,Asia,1957,30.332,9240934,820.85303
2,Afghanistan,Asia,1962,31.997,10267083,853.10071
3,Afghanistan,Asia,1967,34.02,11537966,836.197138
4,Afghanistan,Asia,1972,36.088,13079460,739.981106


In [50]:
df.loc[0]

country      Afghanistan
continent           Asia
year                1952
lifeExp           28.801
pop              8425333
gdpPercap        779.445
Name: 0, dtype: object

In [51]:
df.loc[1]

country      Afghanistan
continent           Asia
year                1957
lifeExp           30.332
pop              9240934
gdpPercap        820.853
Name: 1, dtype: object

In [52]:
# loc works on values not index
df.iloc[-1]

KeyError: -1

In [28]:
df.iloc[0]

country      Afghanistan
continent           Asia
year                1952
lifeExp           28.801
pop              8425333
gdpPercap        779.445
Name: 0, dtype: object

In [29]:
df.iloc[-1]

country      Zimbabwe
continent      Africa
year             2007
lifeExp        43.487
pop          12311143
gdpPercap     469.709
Name: 1703, dtype: object

In [73]:
df.head()

Unnamed: 0,country,continent,year,lifeExp,pop,gdpPercap
0,Afghanistan,Asia,1952,28.801,8425333,779.445314
1,Afghanistan,Asia,1957,30.332,9240934,820.85303
2,Afghanistan,Asia,1962,31.997,10267083,853.10071
3,Afghanistan,Asia,1967,34.02,11537966,836.197138
4,Afghanistan,Asia,1972,36.088,13079460,739.981106


In [74]:
df.loc[[0, 1, 10]]

Unnamed: 0,country,continent,year,lifeExp,pop,gdpPercap
0,Afghanistan,Asia,1952,28.801,8425333,779.445314
1,Afghanistan,Asia,1957,30.332,9240934,820.85303
10,Afghanistan,Asia,2002,42.129,25268405,726.734055


# subsetting rows and columns

In [76]:
df.loc[:, :]

Unnamed: 0,country,continent,year,lifeExp,pop,gdpPercap
0,Afghanistan,Asia,1952,28.801,8425333,779.445314
1,Afghanistan,Asia,1957,30.332,9240934,820.853030
2,Afghanistan,Asia,1962,31.997,10267083,853.100710
3,Afghanistan,Asia,1967,34.020,11537966,836.197138
4,Afghanistan,Asia,1972,36.088,13079460,739.981106
...,...,...,...,...,...,...
1699,Zimbabwe,Africa,1987,62.351,9216418,706.157306
1700,Zimbabwe,Africa,1992,60.377,10704340,693.420786
1701,Zimbabwe,Africa,1997,46.809,11404948,792.449960
1702,Zimbabwe,Africa,2002,39.989,11926563,672.038623


In [77]:
df.loc[:, ['year', 'pop']]

Unnamed: 0,year,pop
0,1952,8425333
1,1957,9240934
2,1962,10267083
3,1967,11537966
4,1972,13079460
...,...,...
1699,1987,9216418
1700,1992,10704340
1701,1997,11404948
1702,2002,11926563


In [78]:
df.loc[[0, 10, 100], ['year', 'pop']]

Unnamed: 0,year,pop
0,1952,8425333
10,2002,25268405
100,1972,70759295


In [81]:
df['country'] == "Zimbabwe"

0       False
1       False
2       False
3       False
4       False
        ...  
1699     True
1700     True
1701     True
1702     True
1703     True
Name: country, Length: 1704, dtype: bool

In [66]:
df.loc[df['country'] == "Zimbabwe"]

Unnamed: 0,country,continent,year,lifeExp,pop,gdpPercap
1692,Zimbabwe,Africa,1952,48.451,3080907,406.884115
1693,Zimbabwe,Africa,1957,50.469,3646340,518.764268
1694,Zimbabwe,Africa,1962,52.358,4277736,527.272182
1695,Zimbabwe,Africa,1967,53.995,4995432,569.795071
1696,Zimbabwe,Africa,1972,55.635,5861135,799.362176
1697,Zimbabwe,Africa,1977,57.674,6642107,685.587682
1698,Zimbabwe,Africa,1982,60.363,7636524,788.855041
1699,Zimbabwe,Africa,1987,62.351,9216418,706.157306
1700,Zimbabwe,Africa,1992,60.377,10704340,693.420786
1701,Zimbabwe,Africa,1997,46.809,11404948,792.44996


In [67]:
df.loc[(df['country'] == "Zimbabwe") & (df['year'] > 190) & (df['lifeExp'] < 40)]

Unnamed: 0,country,continent,year,lifeExp,pop,gdpPercap
1692,Zimbabwe,Africa,1952,48.451,3080907,406.884115
1693,Zimbabwe,Africa,1957,50.469,3646340,518.764268
1694,Zimbabwe,Africa,1962,52.358,4277736,527.272182
1695,Zimbabwe,Africa,1967,53.995,4995432,569.795071
1696,Zimbabwe,Africa,1972,55.635,5861135,799.362176
1697,Zimbabwe,Africa,1977,57.674,6642107,685.587682
1698,Zimbabwe,Africa,1982,60.363,7636524,788.855041
1699,Zimbabwe,Africa,1987,62.351,9216418,706.157306
1700,Zimbabwe,Africa,1992,60.377,10704340,693.420786
1701,Zimbabwe,Africa,1997,46.809,11404948,792.44996


In [87]:
df.loc[(df['country'] == "Zimbabwe") & (df['year'] > 1990), ['pop', 'continent']]

Unnamed: 0,pop,continent
1700,10704340,Africa
1701,11404948,Africa
1702,11926563,Africa
1703,12311143,Africa


In [88]:
df.loc[(df.country == "Zimbabwe") & (df.year > 1990), ['pop', 'continent']]

Unnamed: 0,pop,continent
1700,10704340,Africa
1701,11404948,Africa
1702,11926563,Africa
1703,12311143,Africa


In [89]:
df.head()

Unnamed: 0,country,continent,year,lifeExp,pop,gdpPercap
0,Afghanistan,Asia,1952,28.801,8425333,779.445314
1,Afghanistan,Asia,1957,30.332,9240934,820.85303
2,Afghanistan,Asia,1962,31.997,10267083,853.10071
3,Afghanistan,Asia,1967,34.02,11537966,836.197138
4,Afghanistan,Asia,1972,36.088,13079460,739.981106


In [73]:
df.groupby(['year'])['lifeExp'].sum()

year
1952    6966.18200
1957    7314.05096
1962    7612.51336
1967    7906.31712
1972    8185.92888
1977    8458.96236
1982    8737.71400
1987    8976.19100
1992    9110.76800
1997    9232.08400
2002    9328.67900
2007    9515.05400
Name: lifeExp, dtype: float64

In [76]:
def mean_minus(arr):
    return np.mean(arr) -100

In [78]:
import numpy as np

In [79]:
df.groupby(['year'])['lifeExp'].agg(mean_minus)

year
1952   -50.942380
1957   -48.492599
1962   -46.390751
1967   -44.321710
1972   -42.352614
1977   -40.429843
1982   -38.466803
1987   -36.787387
1992   -35.839662
1997   -34.985324
2002   -34.305077
2007   -32.992577
Name: lifeExp, dtype: float64

In [80]:
df.groupby(['year'])['pop'].agg(np.mean)

year
1952    1.695040e+07
1957    1.876341e+07
1962    2.042101e+07
1967    2.265830e+07
1972    2.518998e+07
1977    2.767638e+07
1982    3.020730e+07
1987    3.303857e+07
1992    3.599092e+07
1997    3.883947e+07
2002    4.145759e+07
2007    4.402122e+07
Name: pop, dtype: float64

In [81]:
df.groupby(['year', 'continent'])[['lifeExp', 'gdpPercap', 'pop']].agg([np.mean, np.std])

Unnamed: 0_level_0,Unnamed: 1_level_0,lifeExp,lifeExp,gdpPercap,gdpPercap,pop,pop
Unnamed: 0_level_1,Unnamed: 1_level_1,mean,std,mean,std,mean,std
year,continent,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
1952,Africa,39.1355,5.151581,1252.572466,982.952116,4570010.0,6317450.0
1952,Americas,53.27984,9.326082,4079.062552,3001.727522,13806100.0,32341630.0
1952,Asia,46.314394,9.291751,5195.484004,18634.890865,42283560.0,113226700.0
1952,Europe,64.4085,6.361088,5661.057435,3114.060493,13937360.0,17247450.0
1952,Oceania,69.255,0.190919,10298.08565,365.560078,5343003.0,4735083.0
1957,Africa,41.266346,5.620123,1385.236062,1134.508918,5093033.0,7076042.0
1957,Americas,55.96028,9.033192,4616.043733,3312.381083,15478160.0,35537060.0
1957,Asia,49.318544,9.635429,5787.73294,19506.515959,47356990.0,128096100.0
1957,Europe,66.703067,5.295805,6963.012816,3677.950146,14596350.0,17832350.0
1957,Oceania,70.295,0.049497,11598.522455,917.644806,5970988.0,5291395.0


In [85]:
import numpy as np
grouped = df.groupby(['year', 'continent'])[['lifeExp', 'gdpPercap', 'pop']].agg(np.mean)
type(grouped)


pandas.core.frame.DataFrame

In [34]:
grouped_reset = grouped.reset_index()

In [114]:
grouped_reset

Unnamed: 0,year,continent,lifeExp,gdpPercap,pop
0,1952,Africa,39.1355,1252.572466,4570010.0
1,1952,Americas,53.27984,4079.062552,13806100.0
2,1952,Asia,46.314394,5195.484004,42283560.0
3,1952,Europe,64.4085,5661.057435,13937360.0
4,1952,Oceania,69.255,10298.08565,5343003.0
5,1957,Africa,41.266346,1385.236062,5093033.0
6,1957,Americas,55.96028,4616.043733,15478160.0
7,1957,Asia,49.318544,5787.73294,47356990.0
8,1957,Europe,66.703067,6963.012816,14596350.0
9,1957,Oceania,70.295,11598.522455,5970988.0
