# EDA Lab


## Pandas

* *pandas* is a Python library for data analysis. 
* It offers a number of data exploration, cleaning and transformation operations that are critical in working with data in Python. 

* *pandas* build upon *numpy* providing easy-to-use data structures and data manipulation functions with integrated indexing.

* The main data structures *pandas* provides are *Series* and *DataFrames*. 

Let's get started 

### Import Pandas and Numpy Libraries

In [1]:
import numpy as np
import pandas as pd


### Introduction to pandas Data Structures

* *pandas* has two main data structures it uses, namely, *Series* and *DataFrames*. 

### pandas Series
- one-dimensional labeled array capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.). 
- The axis labels are collectively referred to as the index. 

### Creating a Series by passing a list of values, letting pandas create a default integer index

In [6]:
data = np.array([1, 3, 5, np.nan, 6, 8])
s = pd.Series(data)
s


0    1.0
1    3.0
2    5.0
3    NaN
4    6.0
5    8.0
dtype: float64

In [7]:
type(s)

pandas.core.series.Series

In [8]:
s.index

RangeIndex(start=0, stop=6, step=1)

In [9]:
print(s[:2])

0    1.0
1    3.0
dtype: float64


### Creating a series with 'object' data type index

In [10]:
ser = pd.Series([100, 'foo', 300, 'bar', 500], index=['tom', 'bob', 'nancy', 'dan', 'eric'])

In [11]:
print(ser)

tom      100
bob      foo
nancy    300
dan      bar
eric     500
dtype: object


In [12]:
ser.index

Index(['tom', 'bob', 'nancy', 'dan', 'eric'], dtype='object')

In [13]:
ser.loc[['nancy','bob']]                #label-location based indexer for selection by label

nancy    300
bob      foo
dtype: object

In [14]:
ser[[4, 3, 1]]

eric    500
dan     bar
bob     foo
dtype: object

In [15]:
ser.iloc[2]                            #integer-location based indexing for selection by position

300

In [16]:
'bob' in ser

True

In [17]:
ser

tom      100
bob      foo
nancy    300
dan      bar
eric     500
dtype: object

In [18]:
ser * 2

tom         200
bob      foofoo
nancy       600
dan      barbar
eric       1000
dtype: object

In [19]:
ser[['nancy', 'eric']] ** 2

nancy     90000
eric     250000
dtype: object

### pandas DataFrame
- *pandas DataFrame* is a 2-dimensional labeled data structure with columns of potentially different types. 
- You can think of it like a spreadsheet or SQL table, or a dict of Series objects.

### Create DataFrame from dictionary of Python Series

In [20]:
d = {'one' : pd.Series([100., 200., 300.], index=['apple', 'ball', 'clock']),
     'two' : pd.Series([111., 222., 333., 4444.], index=['apple', 'ball', 'cerill', 'dancy'])}
print(d)

{'one': apple    100.0
ball     200.0
clock    300.0
dtype: float64, 'two': apple      111.0
ball       222.0
cerill     333.0
dancy     4444.0
dtype: float64}


In [21]:
df = pd.DataFrame(d)
df

Unnamed: 0,one,two
apple,100.0,111.0
ball,200.0,222.0
cerill,,333.0
clock,300.0,
dancy,,4444.0


In [22]:
df.index

Index(['apple', 'ball', 'cerill', 'clock', 'dancy'], dtype='object')

In [23]:
df.columns

Index(['one', 'two'], dtype='object')

In [24]:
pd.DataFrame(d, index=['dancy', 'ball', 'apple'])

Unnamed: 0,one,two
dancy,,4444.0
ball,200.0,222.0
apple,100.0,111.0


In [25]:
pd.DataFrame(d, index=['dancy', 'ball', 'apple'], columns=['two', 'five'])

Unnamed: 0,two,five
dancy,4444.0,
ball,222.0,
apple,111.0,


### Create DataFrame from list of Python dictionaries

In [26]:
data = [{'alex': 1, 'joe': 2}, {'ema': 5, 'dora': 10, 'alice': 20}]
data

[{'alex': 1, 'joe': 2}, {'ema': 5, 'dora': 10, 'alice': 20}]

In [27]:
pd.DataFrame(data)

Unnamed: 0,alex,joe,ema,dora,alice
0,1.0,2.0,,,
1,,,5.0,10.0,20.0


In [28]:
pd.DataFrame(data, index=['orange', 'red'])

Unnamed: 0,alex,joe,ema,dora,alice
orange,1.0,2.0,,,
red,,,5.0,10.0,20.0


In [29]:
pd.DataFrame(data, columns=['joe', 'dora','alice'])

Unnamed: 0,joe,dora,alice
0,2.0,,
1,,10.0,20.0


### Basic DataFrame operations

In [30]:
df

Unnamed: 0,one,two
apple,100.0,111.0
ball,200.0,222.0
cerill,,333.0
clock,300.0,
dancy,,4444.0


In [31]:
df['one']

apple     100.0
ball      200.0
cerill      NaN
clock     300.0
dancy       NaN
Name: one, dtype: float64

In [32]:
df['three'] = df['one'] * df['two']
df

Unnamed: 0,one,two,three
apple,100.0,111.0,11100.0
ball,200.0,222.0,44400.0
cerill,,333.0,
clock,300.0,,
dancy,,4444.0,


In [33]:
df['flag'] = df['one'] > 250
df

Unnamed: 0,one,two,three,flag
apple,100.0,111.0,11100.0,False
ball,200.0,222.0,44400.0,False
cerill,,333.0,,False
clock,300.0,,,True
dancy,,4444.0,,False


In [34]:
three = df.pop('three')

In [35]:
three

apple     11100.0
ball      44400.0
cerill        NaN
clock         NaN
dancy         NaN
Name: three, dtype: float64

In [36]:
df

Unnamed: 0,one,two,flag
apple,100.0,111.0,False
ball,200.0,222.0,False
cerill,,333.0,False
clock,300.0,,True
dancy,,4444.0,False


In [37]:
del df['two']

In [38]:
df

Unnamed: 0,one,flag
apple,100.0,False
ball,200.0,False
cerill,,False
clock,300.0,True
dancy,,False


In [39]:
df.insert(2, 'copy_of_one', df['one'])
df

Unnamed: 0,one,flag,copy_of_one
apple,100.0,False,100.0
ball,200.0,False,200.0
cerill,,False,
clock,300.0,True,300.0
dancy,,False,


In [40]:
df['one_upper_half'] = df['one'][:2]
df

Unnamed: 0,one,flag,copy_of_one,one_upper_half
apple,100.0,False,100.0,100.0
ball,200.0,False,200.0,200.0
cerill,,False,,
clock,300.0,True,300.0,
dancy,,False,,


### Example for creating DataFrame from list 

In [42]:
columns = ['name', 'age', 'gender', 'job']

customer1 = pd.DataFrame([['mary', 19, "F", "student"],
                          ['akbar', 26, "M", "student"]],
                         columns=columns)

customer2 = pd.DataFrame([['amar', 22, "M", "student"],
                          ['alice', 58, "F", "manager"]],
                         columns=columns)

customer3 = pd.DataFrame(dict(name=['dinesh', 'julie'],
                              age=[33, 44], gender=['M', 'F'],
                              job=['engineer', 'scientist']))

customer3

Unnamed: 0,name,age,gender,job
0,dinesh,33,M,engineer
1,julie,44,F,scientist


#### combing DataFrames using concat

In [43]:
#combing DataFrames using concat
customer1.append(customer2)
customers = pd.concat([customer1, customer2, customer3])
customers

Unnamed: 0,name,age,gender,job
0,mary,19,F,student
1,akbar,26,M,student
0,amar,22,M,student
1,alice,58,F,manager
0,dinesh,33,M,engineer
1,julie,44,F,scientist


#### join DataFrame using merge

In [44]:
# join DataFrame
customer4 = pd.DataFrame(dict(name=['mary', 'akbar', 'amar', 'julie'],
                          height=[165, 180, 175, 171]))
customer4

Unnamed: 0,name,height
0,mary,165
1,akbar,180
2,amar,175
3,julie,171


In [45]:
# Use intersection of keys from both (i.e., customers and customer4) DataFrames
merge_inter = #CODE HERE
merge_inter

Unnamed: 0,name,age,gender,job,height
0,mary,19,F,student,165
1,akbar,26,M,student,180
2,amar,22,M,student,175
3,julie,44,F,scientist,171


In [46]:
# Use union of keys from both DataFrames
customers = #CODE HERE
customers

Unnamed: 0,name,age,gender,job,height
0,mary,19,F,student,165.0
1,akbar,26,M,student,180.0
2,amar,22,M,student,175.0
3,alice,58,F,manager,
4,dinesh,33,M,engineer,
5,julie,44,F,scientist,171.0


#### Summarizing

In [62]:
#Summarizing
customers                    # print the first 30 and last 30 rows


Unnamed: 0,name,age,gender,job,height
0,mary,19,F,student,165.0
1,akbar,26,M,student,180.0
2,amar,22,M,student,175.0
3,alice,58,F,manager,
4,dinesh,33,M,engineer,
5,julie,44,F,scientist,171.0


In [64]:
type(customers)              # DataFrame


pandas.core.frame.DataFrame

In [65]:
#CODE HERE             # print the first 5 rows


Unnamed: 0,name,age,gender,job,height
0,mary,19,F,student,165.0
1,akbar,26,M,student,180.0
2,amar,22,M,student,175.0
3,alice,58,F,manager,
4,dinesh,33,M,engineer,


In [66]:
#CODE HERE             # print the last 5 rows


Unnamed: 0,name,age,gender,job,height
1,akbar,26,M,student,180.0
2,amar,22,M,student,175.0
3,alice,58,F,manager,
4,dinesh,33,M,engineer,
5,julie,44,F,scientist,171.0


In [67]:
#CODE HERE              # "the index" (aka "the labels")


Int64Index([0, 1, 2, 3, 4, 5], dtype='int64')

In [68]:
#CODE HERE            # column names (which is "an index")


Index(['name', 'age', 'gender', 'job', 'height'], dtype='object')

In [69]:
#CODE HERE             # data types of each column


name       object
age         int64
gender     object
job        object
height    float64
dtype: object

In [70]:
#CODE HERE              # number of rows and columns


(6, 5)

In [72]:
#CODE HERE             # underlying numpy array


array([['mary', 19, 'F', 'student', 165.0],
       ['akbar', 26, 'M', 'student', 180.0],
       ['amar', 22, 'M', 'student', 175.0],
       ['alice', 58, 'F', 'manager', nan],
       ['dinesh', 33, 'M', 'engineer', nan],
       ['julie', 44, 'F', 'scientist', 171.0]], dtype=object)

In [73]:
customers.info()             # concise summary 


<class 'pandas.core.frame.DataFrame'>
Int64Index: 6 entries, 0 to 5
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   name    6 non-null      object 
 1   age     6 non-null      int64  
 2   gender  6 non-null      object 
 3   job     6 non-null      object 
 4   height  4 non-null      float64
dtypes: float64(1), int64(1), object(3)
memory usage: 460.0+ bytes


#### Columns selection

In [74]:
# Columns selection
customers['gender']                 # select one column
print(type(customers['gender']))    # Series
customers.gender                    # select one column using the DataFrame


<class 'pandas.core.series.Series'>


0    F
1    M
2    M
3    F
4    M
5    F
Name: gender, dtype: object

In [49]:
# select multiple columns
my_cols = #CODE HERE        # select two columns i.e., age and genderor by creating a list...
print(customers[my_cols])           # ...and use that list to select columns
print(type(customers[my_cols]))     # DataFrame


   age gender
0   19      F
1   26      M
2   22      M
3   58      F
4   33      M
5   44      F
<class 'pandas.core.frame.DataFrame'>


#### Rows selection

In [50]:
# iloc is strictly integer position based
df = customers.copy()
df.iloc[0]                            # first row
df.iloc[0, 0]                         # first item of first row
df.iloc[0, 0] = 55

for i in range(customers.shape[0]):
    row = df.iloc[i]
    row.age *= 100                    # setting a copy, and not the original frame data.

df                             # df is not modified

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  cacher_needs_updating = self._check_is_chained_assignment_possible()


Unnamed: 0,name,age,gender,job,height
0,55,19,F,student,165.0
1,akbar,26,M,student,180.0
2,amar,22,M,student,175.0
3,alice,58,F,manager,
4,dinesh,33,M,engineer,
5,julie,44,F,scientist,171.0


In [51]:
# ix supports mixed integer and label based access.
df = customers.copy()
df.loc[0]         # first row
df.loc[0, "age"]  # first item of first row
df.loc[0, "age"] = 55

for i in range(df.shape[0]): 
    df.loc[i, "age"] *= 10

df         # df is modified

Unnamed: 0,name,age,gender,job,height
0,mary,550,F,student,165.0
1,akbar,260,M,student,180.0
2,amar,220,M,student,175.0
3,alice,580,F,manager,
4,dinesh,330,M,engineer,
5,julie,440,F,scientist,171.0


#### Row selection with simple logical filtering

In [58]:
#CODE HERE            # only show users with age < 20


Unnamed: 0,name,age,gender,job,height
0,mary,19,F,student,165.0


#### Row selection with advanced logical filtering

In [55]:
#CODE HERE             # select multiple columns i.e., age and job with age < 20


Unnamed: 0,age,job
0,19,student


In [56]:
customers[(customers.age > 20) & (customers.gender == 'M')]   # use multiple conditions


Unnamed: 0,name,age,gender,job,height
1,akbar,26,M,student,180.0
2,amar,22,M,student,175.0
4,dinesh,33,M,engineer,


In [57]:
#CODE HERE    # filter specific values i.e., job as student or engineer

Unnamed: 0,name,age,gender,job,height
0,mary,19,F,student,165.0
1,akbar,26,M,student,180.0
2,amar,22,M,student,175.0
4,dinesh,33,M,engineer,


In [77]:
df

Unnamed: 0,name,age,gender,job,height
0,mary,550,F,student,165.0
1,akbar,260,M,student,180.0
2,amar,220,M,student,175.0
3,alice,580,F,manager,
4,dinesh,330,M,engineer,
5,julie,440,F,scientist,171.0


#### Descriptive statistics

In [75]:
df.describe()                #Summarize all numeric columns

Unnamed: 0,age,height
count,6.0,4.0
mean,396.666667,172.75
std,150.554531,6.344289
min,220.0,165.0
25%,277.5,169.5
50%,385.0,173.0
75%,522.5,176.25
max,580.0,180.0


In [76]:
#CODE HERE   #Summarize all columns

Unnamed: 0,name,age,gender,job,height
count,6,6.0,6,6,4.0
unique,6,,2,4,
top,mary,,F,student,
freq,1,,3,3,
mean,,396.666667,,,172.75
std,,150.554531,,,6.344289
min,,220.0,,,165.0
25%,,277.5,,,169.5
50%,,385.0,,,173.0
75%,,522.5,,,176.25


In [78]:
df.describe(include=['object'])         #limit to objects

Unnamed: 0,name,gender,job
count,6,6,6
unique,6,2,4
top,mary,F,student
freq,1,3,3


#### Statistics per group (groupby)

In [79]:
print(df.groupby("job").mean())

                  age      height
job                              
engineer   330.000000         NaN
manager    580.000000         NaN
scientist  440.000000  171.000000
student    343.333333  173.333333


In [80]:
#CODE HERE   # groupby job with age statistics

job
engineer     330.000000
manager      580.000000
scientist    440.000000
student      343.333333
Name: age, dtype: float64


## Reference on Pandas for more details
* *pandas* Documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/index.html
* *Python for Data Analysis* by Wes McKinney
* *Python Data Science Handbook* by Jake VanderPlas

