# Pandas!
Today I will learn how to use the pandas tool to analyse data. Let's go!

Pandas data structure is based on Series and DataFrames.

A series is a column and a DataFrame is a multidimensional table made up of collection of series. 

In [194]:
import pandas as pd
import numpy as np

### Creating a series with a default id

In [195]:
nums = [1,2,3,4,5]
s = pd.Series(nums)
print(s)

0    1
1    2
2    3
3    4
4    5
dtype: int64


### Creating a series with a custom id

In [196]:
fruits = ["orange", "papaya", "banana", "apple", "grape"]
fruits = pd.Series(fruits, index = [1,2,3,4,5])
print(fruits)

1    orange
2    papaya
3    banana
4     apple
5     grape
dtype: object


### Creating a sereis from a dictionary

In [197]:
dct = {'name': 'Pola',
        'country': 'Poland',
        'Age': 21}

In [198]:
s = pd.Series(dct)
print(s)

name         Pola
country    Poland
Age            21
dtype: object


### Creating a constant series

In [199]:
s = pd.Series(21, index=[1,2,3,4,5,6,7,8,9,10])
print(s)

1     21
2     21
3     21
4     21
5     21
6     21
7     21
8     21
9     21
10    21
dtype: int64


### Creating a series using linspace and arange

In [200]:
s = pd.Series(np.linspace(1,21,10))
print(s)

d = pd.Series(np.arange(1,22,1))
print(d)

0     1.000000
1     3.222222
2     5.444444
3     7.666667
4     9.888889
5    12.111111
6    14.333333
7    16.555556
8    18.777778
9    21.000000
dtype: float64
0      1
1      2
2      3
3      4
4      5
5      6
6      7
7      8
8      9
9     10
10    11
11    12
12    13
13    14
14    15
15    16
16    17
17    18
18    19
19    20
20    21
dtype: int64


# DataFrames

### Using a list of lists

In [201]:
data = [
    ['Pola', 'Poland', 'Łódź'],
    ['Juki', 'Poland', 'KittyCity'],
    ['Lala', 'Poland', 'Cattown']
]

df = pd.DataFrame(data, columns = ['Name', 'Country', 'City'])
df

Unnamed: 0,Name,Country,City
0,Pola,Poland,Łódź
1,Juki,Poland,KittyCity
2,Lala,Poland,Cattown


### Using a dictionary

In [202]:
data = {
    'Name' : ['Pola', 'Juki', 'Lala'],
    'Country' : ['Poland', 'Poland','Poland'],
    'City' : ['Łódź', 'KittyCity', 'Cattown']
}

df = pd.DataFrame(data)
df

Unnamed: 0,Name,Country,City
0,Pola,Poland,Łódź
1,Juki,Poland,KittyCity
2,Lala,Poland,Cattown


# Reading CSV files using Pandas

In [203]:
df = pd.read_csv('weight-height.csv')
df

Unnamed: 0,Gender,Height,Weight
0,Male,73.847017,241.893563
1,Male,68.781904,162.310473
2,Male,74.110105,212.740856
3,Male,71.730978,220.042470
4,Male,69.881796,206.349801
...,...,...,...
9995,Female,66.172652,136.777454
9996,Female,67.067155,170.867906
9997,Female,63.867992,128.475319
9998,Female,69.034243,163.852461


In [204]:
df.head()

Unnamed: 0,Gender,Height,Weight
0,Male,73.847017,241.893563
1,Male,68.781904,162.310473
2,Male,74.110105,212.740856
3,Male,71.730978,220.04247
4,Male,69.881796,206.349801


In [205]:
df.head(10)

Unnamed: 0,Gender,Height,Weight
0,Male,73.847017,241.893563
1,Male,68.781904,162.310473
2,Male,74.110105,212.740856
3,Male,71.730978,220.04247
4,Male,69.881796,206.349801
5,Male,67.253016,152.212156
6,Male,68.785081,183.927889
7,Male,68.348516,167.97111
8,Male,67.01895,175.92944
9,Male,63.456494,156.399676


In [206]:
df.tail()

Unnamed: 0,Gender,Height,Weight
9995,Female,66.172652,136.777454
9996,Female,67.067155,170.867906
9997,Female,63.867992,128.475319
9998,Female,69.034243,163.852461
9999,Female,61.944246,113.649103


Shape will tell us how many rows and columns we have - useful if we can't display the entire data.

In [207]:
print(df.shape)

(10000, 3)


In [208]:
print(df.columns)

Index(['Gender', 'Height', 'Weight'], dtype='object')


In [209]:
heights = df['Height']
heights

0       73.847017
1       68.781904
2       74.110105
3       71.730978
4       69.881796
          ...    
9995    66.172652
9996    67.067155
9997    63.867992
9998    69.034243
9999    61.944246
Name: Height, Length: 10000, dtype: float64

We can use describe() to get a statistical description of our data.

In [210]:
heights.describe()

count    10000.000000
mean        66.367560
std          3.847528
min         54.263133
25%         63.505620
50%         66.318070
75%         69.174262
max         78.998742
Name: Height, dtype: float64

In [211]:
df.describe()

Unnamed: 0,Height,Weight
count,10000.0,10000.0
mean,66.36756,161.440357
std,3.847528,32.108439
min,54.263133,64.700127
25%,63.50562,135.818051
50%,66.31807,161.212928
75%,69.174262,187.169525
max,78.998742,269.989699


In [212]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Gender  10000 non-null  object 
 1   Height  10000 non-null  float64
 2   Weight  10000 non-null  float64
dtypes: float64(2), object(1)
memory usage: 234.5+ KB


# Modifying DataFrames


In [213]:
data = {
    'Name' : ['Pola', 'Juki', 'Lala'],
    'Country' : ['Poland', 'Poland','Poland'],
    'City' : ['Łódź', 'KittyCity', 'Cattown']
}

df = pd.DataFrame(data)
df

Unnamed: 0,Name,Country,City
0,Pola,Poland,Łódź
1,Juki,Poland,KittyCity
2,Lala,Poland,Cattown


### Let's add a new column

In [214]:
weights = [58,3,2]
df['Weight'] = weights
df

Unnamed: 0,Name,Country,City,Weight
0,Pola,Poland,Łódź,58
1,Juki,Poland,KittyCity,3
2,Lala,Poland,Cattown,2


In [215]:
heights = [160, 25, 20]
df['Height'] = heights
df

Unnamed: 0,Name,Country,City,Weight,Height
0,Pola,Poland,Łódź,58,160
1,Juki,Poland,KittyCity,3,25
2,Lala,Poland,Cattown,2,20


Now let's calculate a new column called BMI! We need to convert the Height to meters and then use the Weight / (Height*Height) formula. I wonder what the BMI value will be for my cats.

In [216]:
df['Height'] = df['Height'] * 0.01
df


Unnamed: 0,Name,Country,City,Weight,Height
0,Pola,Poland,Łódź,58,1.6
1,Juki,Poland,KittyCity,3,0.25
2,Lala,Poland,Cattown,2,0.2


In [217]:
def calculate_bmi():
    return df['Weight'] / (df['Height']**2)


df['BMI'] = calculate_bmi()
df['BMI']  =round(df['BMI'],1)
df

Unnamed: 0,Name,Country,City,Weight,Height,BMI
0,Pola,Poland,Łódź,58,1.6,22.7
1,Juki,Poland,KittyCity,3,0.25,48.0
2,Lala,Poland,Cattown,2,0.2,50.0


My cats are morbidly obese <3 

... Well of course not, they are cats, we can't calculate BMI for them correctly BUT my BMI looks great! 

As per the tutorials orders, let's add more columns

In [218]:
birth_year = [2003,2024,2024]
current_year = pd.Series(2025, index = [0,1,2])
df['Birth year'] = birth_year
df['Current year'] = current_year
df

Unnamed: 0,Name,Country,City,Weight,Height,BMI,Birth year,Current year
0,Pola,Poland,Łódź,58,1.6,22.7,2003,2025
1,Juki,Poland,KittyCity,3,0.25,48.0,2024,2025
2,Lala,Poland,Cattown,2,0.2,50.0,2024,2025


In [219]:
def calculate_age():
    return df['Current year'] - df['Birth year']
df['Age'] = calculate_age()
df

Unnamed: 0,Name,Country,City,Weight,Height,BMI,Birth year,Current year,Age
0,Pola,Poland,Łódź,58,1.6,22.7,2003,2025,22
1,Juki,Poland,KittyCity,3,0.25,48.0,2024,2025,1
2,Lala,Poland,Cattown,2,0.2,50.0,2024,2025,1


### Checking and changing types

In [220]:
print(df.Weight.dtype)

int64


In [221]:
print(df.BMI.dtype)

float64


Let's change it to a string for the excercises purpose.

In [223]:
df['BMI'] = df['BMI'].astype('string')
print(df['BMI'].dtype)

string


### Boolean indexing

In [226]:
df[df['Age'] < 20]

Unnamed: 0,Name,Country,City,Weight,Height,BMI,Birth year,Current year,Age
1,Juki,Poland,KittyCity,3,0.25,48.0,2024,2025,1
2,Lala,Poland,Cattown,2,0.2,50.0,2024,2025,1


In [227]:
df[df['Age'] > 20]

Unnamed: 0,Name,Country,City,Weight,Height,BMI,Birth year,Current year,Age
0,Pola,Poland,Łódź,58,1.6,22.7,2003,2025,22


## Time for some exercises of my own!

In [229]:
df = pd.read_csv('hacker_news.csv')
df

Unnamed: 0,id,title,url,num_points,num_comments,author,created_at
0,12224879,Interactive Dynamic Video,http://www.interactivedynamicvideo.com/,386,52,ne0phyte,8/4/2016 11:52
1,11964716,Florida DJs May Face Felony for April Fools' W...,http://www.thewire.com/entertainment/2013/04/f...,2,1,vezycash,6/23/2016 22:20
2,11919867,Technology ventures: From Idea to Enterprise,https://www.amazon.com/Technology-Ventures-Ent...,3,1,hswarna,6/17/2016 0:01
3,10301696,Note by Note: The Making of Steinway L1037 (2007),http://www.nytimes.com/2007/11/07/movies/07ste...,8,2,walterbell,9/30/2015 4:12
4,10482257,Title II kills investment? Comcast and other I...,http://arstechnica.com/business/2015/10/comcas...,53,22,Deinos,10/31/2015 9:48
...,...,...,...,...,...,...,...
20094,12379592,How Purism Avoids Intels Active Management Tec...,https://puri.sm/philosophy/how-purism-avoids-i...,10,6,AdmiralAsshat,8/29/2016 2:22
20095,10339284,YC Application Translated and Broken Down,https://medium.com/@zreitano/the-yc-applicatio...,4,1,zreitano,10/6/2015 14:57
20096,10824382,Microkernels are slow and Elvis didn't do no d...,http://blog.darknedgy.net/technology/2016/01/0...,169,132,vezzy-fnord,1/2/2016 0:49
20097,10739875,How Product Hunt really works,https://medium.com/@benjiwheeler/how-product-h...,695,222,brw12,12/15/2015 19:32


In [231]:
df.head()

Unnamed: 0,id,title,url,num_points,num_comments,author,created_at
0,12224879,Interactive Dynamic Video,http://www.interactivedynamicvideo.com/,386,52,ne0phyte,8/4/2016 11:52
1,11964716,Florida DJs May Face Felony for April Fools' W...,http://www.thewire.com/entertainment/2013/04/f...,2,1,vezycash,6/23/2016 22:20
2,11919867,Technology ventures: From Idea to Enterprise,https://www.amazon.com/Technology-Ventures-Ent...,3,1,hswarna,6/17/2016 0:01
3,10301696,Note by Note: The Making of Steinway L1037 (2007),http://www.nytimes.com/2007/11/07/movies/07ste...,8,2,walterbell,9/30/2015 4:12
4,10482257,Title II kills investment? Comcast and other I...,http://arstechnica.com/business/2015/10/comcas...,53,22,Deinos,10/31/2015 9:48


In [232]:
df.tail()

Unnamed: 0,id,title,url,num_points,num_comments,author,created_at
20094,12379592,How Purism Avoids Intels Active Management Tec...,https://puri.sm/philosophy/how-purism-avoids-i...,10,6,AdmiralAsshat,8/29/2016 2:22
20095,10339284,YC Application Translated and Broken Down,https://medium.com/@zreitano/the-yc-applicatio...,4,1,zreitano,10/6/2015 14:57
20096,10824382,Microkernels are slow and Elvis didn't do no d...,http://blog.darknedgy.net/technology/2016/01/0...,169,132,vezzy-fnord,1/2/2016 0:49
20097,10739875,How Product Hunt really works,https://medium.com/@benjiwheeler/how-product-h...,695,222,brw12,12/15/2015 19:32
20098,11680777,RoboBrowser: Your friendly neighborhood web sc...,https://github.com/jmcarp/robobrowser,182,58,pmoriarty,5/12/2016 1:43


In [234]:
title_series = pd.Series(df['title'])
title_series

0                                Interactive Dynamic Video
1        Florida DJs May Face Felony for April Fools' W...
2             Technology ventures: From Idea to Enterprise
3        Note by Note: The Making of Steinway L1037 (2007)
4        Title II kills investment? Comcast and other I...
                               ...                        
20094    How Purism Avoids Intels Active Management Tec...
20095            YC Application Translated and Broken Down
20096    Microkernels are slow and Elvis didn't do no d...
20097                        How Product Hunt really works
20098    RoboBrowser: Your friendly neighborhood web sc...
Name: title, Length: 20099, dtype: object

In [236]:
df.shape

(20099, 7)

In [238]:
df[df['title'].str.contains('python')]

Unnamed: 0,id,title,url,num_points,num_comments,author,created_at
765,12117140,IronPython 3 (python for .net) development res...,https://www.reddit.com/r/Python/comments/4tbhw...,3,1,tanlermin,7/18/2016 18:34
6129,10512775,Learn data science in python by taking online ...,http://www.dezyre.com/data-science-in-python/36,1,1,shankar251289,11/5/2015 11:34
6711,11328646,"Show HN: Dplython, dplyr data manipulation for...",https://github.com/dodger487/dplython,11,1,capybara,3/21/2016 15:18
7979,10993953,Kickstarter for funding Micropython port to ES...,https://www.kickstarter.com/projects/214379695...,98,41,Sfabris,1/29/2016 7:16
13218,12095313,DQN for Beginners in 200 lines of python code ...,https://yanpanlau.github.io/2016/07/10/FlappyB...,5,1,yanpanlau,7/14/2016 16:40
13575,11269629,Show HN: Venv2docker create a docker image fr...,https://github.com/Markbnj/venv2docker,8,2,markbnj,3/11/2016 21:20
14004,12200724,EasyMake one python file instead tons of Make...,https://github.com/l4l/EasyMake,1,3,kitsu,8/1/2016 7:29
15784,11008512,let in python (2014),https://nvbn.github.io/2014/09/25/let-statemen...,1,2,hatmatrix,1/31/2016 22:43
16026,12238517,Ptpython better than ipython or bpython?,http://terriblecode.com/why-ptpython-is-the-on...,1,1,ausjke,8/6/2016 15:55
18421,12027874,Show HN: Stack overflow command line client ad...,https://github.com/gautamkrishnar/socli,1,1,gautamkrishnar,7/3/2016 22:12


In [239]:
df[df['title'].str.contains('JavaScript')]

Unnamed: 0,id,title,url,num_points,num_comments,author,created_at
267,12352636,Show HN: Hire JavaScript - Top JavaScript Talent,https://www.hirejs.com/,1,1,eibrahim,8/24/2016 15:16
811,10741251,Ask HN: Are there any projects or compilers wh...,,1,2,ggonweb,12/15/2015 23:26
1046,11343334,"If you write JavaScript tools or libraries, bu...",https://medium.com/@Rich_Harris/how-to-not-bre...,48,19,callumlocke,3/23/2016 10:54
1093,10422726,Rollup.js: A next-generation JavaScript module...,http://rollupjs.org,57,17,dmmalam,10/21/2015 0:02
1162,12461624,V8 JavaScript Engine: V8 Release 5.4,http://v8project.blogspot.com/2016/09/v8-relea...,126,19,okket,9/9/2016 12:46
...,...,...,...,...,...,...,...
19349,11448301,"Fotorama, a responsive JavaScript photo gallery",http://fotorama.io/,1,1,alexkon,4/7/2016 15:59
19548,12105148,Another Kind of JavaScript Fatigue,http://chrismm.com/blog/the-other-kind-of-java...,9,2,JacksCracked,7/16/2016 3:44
19610,12203508,Lonely programmer detective uncovers the Mozil...,http://stackoverflow.com/a/38677222/984780,29,8,luisperezphd,8/1/2016 16:07
19885,12552131,Ask HN: Best Practices for CSS in a Modern Jav...,,6,6,xwvvvvwx,9/21/2016 20:53
