# Pandas Basics

## Today's Outline:
- Why Pandas?
- **Pandas Functions**
- [Practical Exercises](https://www.w3resource.com/python-exercises/pandas/index.php)
- Case-study

==========

## Why Pandas
- Python Data Analysis Library (Pandas)
- Dealing with Tabular Data
- Data Preprocessing & Manipulation
- Data Analysis & Visualization

#### Pandas Documentation
https://pandas.pydata.org/docs/index.html

Download Cheat-Sheet from Here:
- https://drive.google.com/file/d/1UHK8wtWbADvHKXFC937IS6MTnlSZC_zB/view
- http://datacamp-community-prod.s3.amazonaws.com/dbed353d-2757-4617-8206-8767ab379ab3

==========

## Pandas Functions
- Importing Pandas
- Creating & Initializing Pandas Data Structures
    - **Series** (1D Vector)
    - **DataFrame** (2D Matrix)
- Data I/O
- Inspecting Properties
- Indexing, Slicing, & Subsetting
    - Labeled-based vs Position-based Indexing
- Data Cleaning (Missing Data)
- Data Manipulation
    - Sorting, Grouping & Pivot Tables
- Statistics

### Importing Pandas

In [1]:
import pandas

In [2]:
# Always try to use this convention
import pandas as pd

In [3]:
# You can select a specific function in Pandas
from pandas import read_csv 

In [4]:
# Also, you can import a specific function in any sub-module
from pandas.plotting import boxplot

In [5]:
# we always use NumPy alongside with Pandas
import numpy as np

==========

### Pandas Data Structures

Pandas provides us with two main data structures:
- **Series** (1D - NumPy-like with indecies, 1-row data, **s** is the convention for a Series object) 
- **DataFrame** (2D - Collection of Series, tabular data, **df** is the convention for a DataFrame object)

### Creating & Initializing Pandas Series

#### Using a List

In [6]:
# Creating a Series object using a List
s = pd.Series([1,2,3,4])
s

0    1
1    2
2    3
3    4
dtype: int64

In [7]:
# You can also specify the indecies
s = pd.Series(data = [1,2,3,4], index = ['A','B','C','D'])
s
# Also, you can type this: s = pd.Series([1,2,3,4], ['A','B','C','D'])

A    1
B    2
C    3
D    4
dtype: int64

#### Using a NumPy Array

In [8]:
# Creating a Series object using a NumPy Array and a miscellaneous indecies
s = pd.Series(np.random.randint(4,size=4), [0,'b',True,None], dtype=float)
s

0       2.0
b       3.0
True    1.0
NaN     2.0
dtype: float64

#### Using a Dictionary

In [9]:
# Creating a Series object using a dictionary
s = pd.Series({'Ahmed':10,'Sara':20,'Mustafa':30})
s

Ahmed      10
Sara       20
Mustafa    30
dtype: int64

#### Using a Scalar

In [10]:
# Creating a Series object using a specific value
s = pd.Series(150, index = np.arange(5))
s

0    150
1    150
2    150
3    150
4    150
dtype: int64

### Creating & Initializing Pandas DataFrame

#### Using a NumPy Array

In [11]:
# Creating a DataFrame object using a NumPy Array and specifing the rows and columns names
df = pd.DataFrame(np.random.rand(2,3),
                  index=['row1', 'row2'],
                  columns=['col1','col2','col3'])
df

Unnamed: 0,col1,col2,col3
row1,0.345322,0.930511,0.844175
row2,0.084815,0.308957,0.237295


In [12]:
array_a = np.array([[3, 2, 1], [6, 3, 2]])
df = pd.DataFrame(array_a)
df

Unnamed: 0,0,1,2
0,3,2,1
1,6,3,2


#### Using a Dictionary

In [13]:
# Creating a DataFrame object using a Dictionary
data = {'Country': ['Egypt', 'Saudi Arabia', 'Qatar'],
        'Capital': ['Cairo', 'Riyadh', 'Doha'],
        'Population': [21190846, 7846277, 2847528]}
df = pd.DataFrame(data, columns=['Country', 'Capital', 'Population'])
df

Unnamed: 0,Country,Capital,Population
0,Egypt,Cairo,21190846
1,Saudi Arabia,Riyadh,7846277
2,Qatar,Doha,2847528


#### Using a Series Object

In [14]:
# Creating a DataFrame object using a Dictionary

student_series = pd.Series(['Ahmed', 'Sara', 'Mustafa', 'Fatma']) 
grade_series = pd.Series([210, 211, 114, 178]) 

df = pd.DataFrame({'Student': student_series, 'Grade': grade_series})

df

Unnamed: 0,Student,Grade
0,Ahmed,210
1,Sara,211
2,Mustafa,114
3,Fatma,178


==========

### Pandas Data I/O

| Method / Attribute                       	| Description                                                                    	|
|------------------------------------------	|--------------------------------------------------------------------------------	|
| pd.read_csv(filename)                    	| From a CSV file                                                                	|
| pd.read_table(filename)                  	| From a delimited text file (like TSV)                                          	|
| pd.read_excel(filename)                  	| From an Excel file                                                             	|
| pd.read_sql(query, connection_object)    	| Read from a SQL table/database                                                 	|
| pd.read_json(json_string)                	| Read from a JSON formatted string, URL or file.                                	|
| pd.read_html(url)                        	| Parses an html URL, string or file and extracts tables to a list of dataframes 	|
| pd.read_clipboard()                      	| Takes the contents of your clipboard and passes it to read_table()             	|
| pd.DataFrame(dict)                       	| From a dict, keys for columns names, values for data as lists                  	|
| df.to_csv(filename)                      	| Write to a CSV file                                                            	|
| df.to_excel(filename)                    	| Write to an Excel file                                                         	|
| df.to_sql(table_name, connection_object) 	| Write to a SQL table                                                           	|
| df.to_json(filename)                     	| Write to a file in JSON format                                                 	|

#### CSV Files

In [15]:
# Importing a csv file
df_csv = pd.read_csv('data/pandas-io.csv')
df_csv

Unnamed: 0,a,b,c,d
0,0,1,2,3
1,4,5,6,7
2,8,9,10,11
3,12,13,14,15


In [16]:
# Exporting a DataFrame to a .csv file
df_csv.to_csv('data/df-to-csv.csv',index=False)

# Reading the .csv file again
df_csv = pd.read_csv('data/df-to-csv.csv')
df_csv

Unnamed: 0,a,b,c,d
0,0,1,2,3
1,4,5,6,7
2,8,9,10,11
3,12,13,14,15


#### Excel Files

In [17]:
# Importing an excel file
df_xlsx = pd.read_excel('data/pandas-io.xlsx',sheet_name='Sheet1')
df_xlsx

Unnamed: 0.1,Unnamed: 0,a,b,c,d
0,0,0,1,2,3
1,1,4,5,6,7
2,2,8,9,10,11
3,3,12,13,14,15


In [18]:
# Exporting a DataFrame to an excel file
df_xlsx.to_excel('data/df-to-excel.xlsx', index=False)

# Reading the .xlsx file again
df_xlsx = pd.read_excel('data/df-to-excel.xlsx')
df_xlsx

Unnamed: 0.1,Unnamed: 0,a,b,c,d
0,0,0,1,2,3
1,1,4,5,6,7
2,2,8,9,10,11
3,3,12,13,14,15


#### HTML Files

Let's get a full table of Aamer Khan filmography from Wikipedia: 
https://en.wikipedia.org/wiki/Aamir_Khan_filmography

In [19]:
# Importing a table from an HTML file
df_html = pd.read_html('https://en.wikipedia.org/wiki/Aamir_Khan_filmography')
df_html[1]

Unnamed: 0_level_0,Title,Year,Credited as,Credited as,Credited as,Credited as,Notes,Ref.
Unnamed: 0_level_1,Title,Year,Actor,Producer,Other,Role,Notes,Ref.
0,Yaadon Ki Baaraat,1973,Yes,,,Young Ratan[II],Minor role,[32]
1,Madhosh,1974,Yes,,,Young Raj[III],Minor role,[32]
2,Paranoia,1983,Yes,,Assistant director,Unknown,Short film,[33][34]
3,Manzil Manzil,1984,,,Assistant director,,,[4]
4,Holi,1984,Yes,,,Madan Sharma,,[33]
...,...,...,...,...,...,...,...,...
56,Dangal,2016,Yes,Yes,Playback singer,Mahavir Singh Phogat,Filmfare Award for Best Actor Filmfare Award f...,[90][91]
57,Secret Superstar,2017,Yes,Yes,,Shakti Kumar,Nominated—Filmfare Award for Best Supporting A...,[92] [93]
58,Thugs of Hindostan,2018,Yes,,,Firangi Mallah,,[94]
59,Laal Singh Chaddha,2021,Yes,Yes,,Laal Singh Chaddha,,[95][96]


In [None]:
df_html[1].to_csv('data/df-to-html.csv',index=False)

#### SQL Database Files

In [20]:
# Importing the sql engine
from sqlalchemy import create_engine

# Let's create a temporary database in the memory
engine = create_engine('sqlite:///:memory:')

# Convert your DataFrame into an sql database
df_csv.to_sql('data', engine)

# Importing the sql database file
sql_df = pd.read_sql('SELECT * FROM data',con=engine)

sql_df

Unnamed: 0,index,a,b,c,d
0,0,0,1,2,3
1,1,4,5,6,7
2,2,8,9,10,11
3,3,12,13,14,15


==========

### Inspecting Properties Using the Common Attributes of Pandas

| Method / Attribute               	| Description                              	|
|----------------------------------	|------------------------------------------	|
| df.head(n)                       	| First n rows of the DataFrame            	|
| df.tail(n)                       	| Last n rows of the DataFrame             	|
| df.shape                         	| Number of rows and columns               	|
| df.info()                        	| Index, Datatype and Memory information   	|
| df.describe()                    	| Summary statistics for numerical columns 	|
| s.value_counts(dropna=False)     	| View unique values and counts            	|
| df.apply(pd.Series.value_counts) 	| Unique values and counts for all columns 	|

#### DataFrame Attributes

In [21]:
# Let's define a DataFrame to inspect its properties
np.random.seed(101)
df = pd.DataFrame(np.random.rand(6,4),
                  index=['A','B','C','D','E','F'],
                  columns=['W','X','Y','Z'])
df

Unnamed: 0,W,X,Y,Z
A,0.516399,0.570668,0.028474,0.171522
B,0.685277,0.833897,0.306966,0.893613
C,0.721544,0.189939,0.554228,0.352132
D,0.181892,0.785602,0.965483,0.232354
E,0.083561,0.603548,0.728993,0.276239
F,0.685306,0.517867,0.048485,0.137869


In [22]:
# A full information about our DataFrame
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 6 entries, A to F
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   W       6 non-null      float64
 1   X       6 non-null      float64
 2   Y       6 non-null      float64
 3   Z       6 non-null      float64
dtypes: float64(4)
memory usage: 240.0+ bytes


In [23]:
# Getting a summary statistics for our dataset
df.describe()

Unnamed: 0,W,X,Y,Z
count,6.0,6.0,6.0,6.0
mean,0.478997,0.583587,0.438771,0.343955
std,0.279279,0.229481,0.377631,0.279798
min,0.083561,0.189939,0.028474,0.137869
25%,0.265519,0.531068,0.113105,0.18673
50%,0.600838,0.587108,0.430597,0.254296
75%,0.685299,0.740088,0.685301,0.333159
max,0.721544,0.833897,0.965483,0.893613


In [24]:
# Getting the first 5 rows in our dataset
df.head()
# df[:5]

Unnamed: 0,W,X,Y,Z
A,0.516399,0.570668,0.028474,0.171522
B,0.685277,0.833897,0.306966,0.893613
C,0.721544,0.189939,0.554228,0.352132
D,0.181892,0.785602,0.965483,0.232354
E,0.083561,0.603548,0.728993,0.276239


In [25]:
# You can also get the last rows in the dataset and specifing how many rows to be printed
df.tail(3)

Unnamed: 0,W,X,Y,Z
D,0.181892,0.785602,0.965483,0.232354
E,0.083561,0.603548,0.728993,0.276239
F,0.685306,0.517867,0.048485,0.137869


In [26]:
# Outputting the columns names
df.columns

Index(['W', 'X', 'Y', 'Z'], dtype='object')

In [27]:
# Outputting the indecies names
df.index

Index(['A', 'B', 'C', 'D', 'E', 'F'], dtype='object')

In [28]:
# The shape of our DataFrame
df.shape

(6, 4)

In [29]:
# Finding the data type of each column
df.dtypes

W    float64
X    float64
Y    float64
Z    float64
dtype: object

In [30]:
# Calculating how many rows in each column
df.count()

W    6
X    6
Y    6
Z    6
dtype: int64

In [31]:
# Converting a DataFrame object to a 2D NumPy Array
df.values

array([[0.51639863, 0.57066759, 0.02847423, 0.17152166],
       [0.68527698, 0.83389686, 0.30696622, 0.89361308],
       [0.72154386, 0.18993895, 0.55422759, 0.35213195],
       [0.1818924 , 0.78560176, 0.96548322, 0.23235366],
       [0.08356143, 0.60354842, 0.72899276, 0.27623883],
       [0.68530633, 0.51786747, 0.04848454, 0.13786924]])

In [32]:
# Let's check out its type to validate our result
type(df.values)

numpy.ndarray

In [33]:
# Another way to convert a DataFrame object to a 2D NumPy Array
df.to_numpy()

array([[0.51639863, 0.57066759, 0.02847423, 0.17152166],
       [0.68527698, 0.83389686, 0.30696622, 0.89361308],
       [0.72154386, 0.18993895, 0.55422759, 0.35213195],
       [0.1818924 , 0.78560176, 0.96548322, 0.23235366],
       [0.08356143, 0.60354842, 0.72899276, 0.27623883],
       [0.68530633, 0.51786747, 0.04848454, 0.13786924]])

In [34]:
# Now our DataFrame becomes a 2D NumPy Array
type(df.to_numpy())

numpy.ndarray

#### Series Attributes

In [35]:
# Some of these attributes are working with Series object too
s = pd.Series([3, 7, 7, 4], index=['a', 'b', 'c', 'd'])
s

a    3
b    7
c    7
d    4
dtype: int64

In [36]:
# Getting the summary statistics
s.describe()

count    4.000000
mean     5.250000
std      2.061553
min      3.000000
25%      3.750000
50%      5.500000
75%      7.000000
max      7.000000
dtype: float64

In [37]:
# Find the shape of the Series object
s.shape

(4,)

In [38]:
# Listing the indecies names
s.index

Index(['a', 'b', 'c', 'd'], dtype='object')

In [39]:
# Converting a Series object to a 1D NumPy Array
s.values

array([3, 7, 7, 4], dtype=int64)

In [40]:
# We can find the count of elements in a Series
s.value_counts()

7    2
3    1
4    1
dtype: int64

In [41]:
# Find the unique values in the Series object
s.unique()

array([3, 7, 4], dtype=int64)

In [42]:
# Find the number of the unique values in the Series object
s.nunique()

3

In [43]:
# Applying a function to a specific column
s.apply(lambda x: x*2)

a     6
b    14
c    14
d     8
dtype: int64

==========

### Indexing, Slicing, & Subsetting

| Method / Attribute 	| Description                             	|
|--------------------	|-----------------------------------------	|
| df[col]            	| Returns column with label col as Series 	|
| df[[col1, col2]]   	| Returns columns as a new DataFrame      	|
| s.iloc[0]          	| Selection by position                   	|
| s.loc['index_one'] 	| Selection by index                      	|
| df.iloc[0,:]       	| First row                               	|
| df.iloc[0,0]       	| First element of first column           	|

#### Series & DataFrame Indexing

In [44]:
df

Unnamed: 0,W,X,Y,Z
A,0.516399,0.570668,0.028474,0.171522
B,0.685277,0.833897,0.306966,0.893613
C,0.721544,0.189939,0.554228,0.352132
D,0.181892,0.785602,0.965483,0.232354
E,0.083561,0.603548,0.728993,0.276239
F,0.685306,0.517867,0.048485,0.137869


In [45]:
# Indexing a specific column
df['W']

A    0.516399
B    0.685277
C    0.721544
D    0.181892
E    0.083561
F    0.685306
Name: W, dtype: float64

In [46]:
# Indexing of a single column produces a Series object
type(df['W'])

pandas.core.series.Series

In [47]:
# There is another syntax for indexing a single column but it is not recommended (SQL Syntax) 
df.W

A    0.516399
B    0.685277
C    0.721544
D    0.181892
E    0.083561
F    0.685306
Name: W, dtype: float64

In [48]:
# Indexing multiple columns
df[['W','Z']]

Unnamed: 0,W,Z
A,0.516399,0.171522
B,0.685277,0.893613
C,0.721544,0.352132
D,0.181892,0.232354
E,0.083561,0.276239
F,0.685306,0.137869


In [49]:
# Sure this gives us a DataFrame object
type(df[['W','Z']])

pandas.core.frame.DataFrame

In [50]:
# You can create a new column just by naming it and assigning it values
df['S'] = df['W'] + df['Y']
df

Unnamed: 0,W,X,Y,Z,S
A,0.516399,0.570668,0.028474,0.171522,0.544873
B,0.685277,0.833897,0.306966,0.893613,0.992243
C,0.721544,0.189939,0.554228,0.352132,1.275771
D,0.181892,0.785602,0.965483,0.232354,1.147376
E,0.083561,0.603548,0.728993,0.276239,0.812554
F,0.685306,0.517867,0.048485,0.137869,0.733791


In [51]:
# You can drop the new column using drop() function
df.drop('S', axis=1, inplace=True)
df

Unnamed: 0,W,X,Y,Z
A,0.516399,0.570668,0.028474,0.171522
B,0.685277,0.833897,0.306966,0.893613
C,0.721544,0.189939,0.554228,0.352132
D,0.181892,0.785602,0.965483,0.232354
E,0.083561,0.603548,0.728993,0.276239
F,0.685306,0.517867,0.048485,0.137869


In [52]:
# You can also drop rows this way
df.drop('D', axis=0)

Unnamed: 0,W,X,Y,Z
A,0.516399,0.570668,0.028474,0.171522
B,0.685277,0.833897,0.306966,0.893613
C,0.721544,0.189939,0.554228,0.352132
E,0.083561,0.603548,0.728993,0.276239
F,0.685306,0.517867,0.048485,0.137869


In [53]:
# How about indexing a Series object
s

a    3
b    7
c    7
d    4
dtype: int64

In [54]:
# Indexing an element in a Series object can be done as follows
s['c']

7

#### Position-based Indexing with .iloc[] (NumPy Indexing Style)

In [55]:
df

Unnamed: 0,W,X,Y,Z
A,0.516399,0.570668,0.028474,0.171522
B,0.685277,0.833897,0.306966,0.893613
C,0.721544,0.189939,0.554228,0.352132
D,0.181892,0.785602,0.965483,0.232354
E,0.083561,0.603548,0.728993,0.276239
F,0.685306,0.517867,0.048485,0.137869


In [56]:
# This returns the 3rd row
df.iloc[2]

W    0.721544
X    0.189939
Y    0.554228
Z    0.352132
Name: C, dtype: float64

In [57]:
# To get a specific element using its position (4th row, 2nd column)
df.iloc[3,1] # df.iloc[[3],[1]]

0.7856017618643588

In [58]:
# Getting all the records in the 4th row
df.iloc[:, 3]

A    0.171522
B    0.893613
C    0.352132
D    0.232354
E    0.276239
F    0.137869
Name: Z, dtype: float64

In [59]:
# Slicing the records
df.iloc[1:4, 3]

B    0.893613
C    0.352132
D    0.232354
Name: Z, dtype: float64

In [60]:
# Indexing using slicing and a list mask (fancy indexing)
df.iloc[2:6:2, [3, 1]]

Unnamed: 0,Z,X
C,0.352132,0.189939
E,0.276239,0.603548


#### Labeled-based Indexing with .loc[]

In [61]:
df

Unnamed: 0,W,X,Y,Z
A,0.516399,0.570668,0.028474,0.171522
B,0.685277,0.833897,0.306966,0.893613
C,0.721544,0.189939,0.554228,0.352132
D,0.181892,0.785602,0.965483,0.232354
E,0.083561,0.603548,0.728993,0.276239
F,0.685306,0.517867,0.048485,0.137869


In [62]:
# This returns the 1st row
df.loc['A']

W    0.516399
X    0.570668
Y    0.028474
Z    0.171522
Name: A, dtype: float64

In [63]:
# To get a specific element, you will index its row and its column
df.loc['B','Y']

0.3069662196722378

In [64]:
# Indexing multiple rows and columns
df.loc[['A','C'],['W','Y']]

Unnamed: 0,W,Y
A,0.516399,0.028474
C,0.721544,0.554228


In [65]:
df.loc['A'].loc['W']

0.5163986277024462

#### Boolean Indexing

In [66]:
df > 0.5

Unnamed: 0,W,X,Y,Z
A,True,True,False,False
B,True,True,False,True
C,True,False,True,False
D,False,True,True,False
E,False,True,True,False
F,True,True,False,False


In [67]:
df[df > 0.5]

Unnamed: 0,W,X,Y,Z
A,0.516399,0.570668,,
B,0.685277,0.833897,,0.893613
C,0.721544,,0.554228,
D,,0.785602,0.965483,
E,,0.603548,0.728993,
F,0.685306,0.517867,,


In [68]:
df[df['W'] > 0.5]

Unnamed: 0,W,X,Y,Z
A,0.516399,0.570668,0.028474,0.171522
B,0.685277,0.833897,0.306966,0.893613
C,0.721544,0.189939,0.554228,0.352132
F,0.685306,0.517867,0.048485,0.137869


In [69]:
df[df['W'] > 0.5]['Y']

A    0.028474
B    0.306966
C    0.554228
F    0.048485
Name: Y, dtype: float64

In [70]:
df[(df['W'] > 0.5) & (df['Y'] < 0.7)]

Unnamed: 0,W,X,Y,Z
A,0.516399,0.570668,0.028474,0.171522
B,0.685277,0.833897,0.306966,0.893613
C,0.721544,0.189939,0.554228,0.352132
F,0.685306,0.517867,0.048485,0.137869


In [71]:
# Reset to default 0,1...n index
df.reset_index()

Unnamed: 0,index,W,X,Y,Z
0,A,0.516399,0.570668,0.028474,0.171522
1,B,0.685277,0.833897,0.306966,0.893613
2,C,0.721544,0.189939,0.554228,0.352132
3,D,0.181892,0.785602,0.965483,0.232354
4,E,0.083561,0.603548,0.728993,0.276239
5,F,0.685306,0.517867,0.048485,0.137869


In [72]:
# Let's add a new column
df['Names'] = ['Ahmed', 'Mustafa', 'Ali', 'Sara', 'Marwa', 'Mai']
df

Unnamed: 0,W,X,Y,Z,Names
A,0.516399,0.570668,0.028474,0.171522,Ahmed
B,0.685277,0.833897,0.306966,0.893613,Mustafa
C,0.721544,0.189939,0.554228,0.352132,Ali
D,0.181892,0.785602,0.965483,0.232354,Sara
E,0.083561,0.603548,0.728993,0.276239,Marwa
F,0.685306,0.517867,0.048485,0.137869,Mai


In [73]:
# Setting the index to a specific column
df.set_index('Names')

Unnamed: 0_level_0,W,X,Y,Z
Names,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Ahmed,0.516399,0.570668,0.028474,0.171522
Mustafa,0.685277,0.833897,0.306966,0.893613
Ali,0.721544,0.189939,0.554228,0.352132
Sara,0.181892,0.785602,0.965483,0.232354
Marwa,0.083561,0.603548,0.728993,0.276239
Mai,0.685306,0.517867,0.048485,0.137869


In [74]:
df

Unnamed: 0,W,X,Y,Z,Names
A,0.516399,0.570668,0.028474,0.171522,Ahmed
B,0.685277,0.833897,0.306966,0.893613,Mustafa
C,0.721544,0.189939,0.554228,0.352132,Ali
D,0.181892,0.785602,0.965483,0.232354,Sara
E,0.083561,0.603548,0.728993,0.276239,Marwa
F,0.685306,0.517867,0.048485,0.137869,Mai


In [76]:
df.drop('Names', axis=1, inplace=True)

In [77]:
df

Unnamed: 0,W,X,Y,Z
A,0.516399,0.570668,0.028474,0.171522
B,0.685277,0.833897,0.306966,0.893613
C,0.721544,0.189939,0.554228,0.352132
D,0.181892,0.785602,0.965483,0.232354
E,0.083561,0.603548,0.728993,0.276239
F,0.685306,0.517867,0.048485,0.137869


==========

### Data Cleaning

| Method / Attribute                           	| Description                                         	|
|----------------------------------------------	|-----------------------------------------------------	|
| df.columns = ['a','b','c']                   	| Rename columns                                      	|
| pd.isnull()                                  	| Checks for null Values, Returns Boolean Arrray      	|
| pd.notnull()                                 	| Opposite of pd.isnull()                             	|
| df.dropna()                                  	| Drop all rows that contain null values              	|
| df.dropna(axis=1)                            	| Drop all columns that contain null values           	|
| df.dropna(axis=1,thresh=n)                   	| Drop all rows have have less than n non null values 	|
| df.fillna(x)                                 	| Replace all null values with x                      	|
| s.fillna(s.mean())                           	| Replace all null values with the mean               	|
| s.astype(float)                              	| Convert the datatype of the series to float         	|
| s.replace(1,'one')                           	| Replace all values equal to 1 with 'one'            	|
| s.replace([2,3],['two', 'three'])            	| Replace all 2 with 'two' and 3 with 'three'         	|
| df.rename(columns={'old_name': 'new_ name'}) 	| Selective renaming                                  	|
| df.set_index('column_one')                   	| Change the index                                    	|

In [78]:
df_clean = pd.DataFrame({'A':[1,2,np.nan],
                    'B':[5,np.nan,np.nan],
                  'C':[1,2,3]})
df_clean

Unnamed: 0,A,B,C
0,1.0,5.0,1
1,2.0,,2
2,,,3


In [79]:
# Checking for all the missing values
df_clean.isnull()

Unnamed: 0,A,B,C
0,False,False,False
1,False,True,False
2,True,True,False


In [80]:
# Summing all the missing values in each columns
df_clean.isna().sum()

A    1
B    2
C    0
dtype: int64

In [81]:
# Dropping all the rows that have missing values
df_clean.dropna()

Unnamed: 0,A,B,C
0,1.0,5.0,1


In [82]:
# Dropping all columns have 2 non-null values or more
df_clean.dropna(axis=1, thresh=2)

Unnamed: 0,A,C
0,1.0,1
1,2.0,2
2,,3


In [83]:
# Filling all the missing values
df_clean.fillna(value='FILL')

Unnamed: 0,A,B,C
0,1,5,1
1,2,FILL,2
2,FILL,FILL,3


In [84]:
# Filling the missing data in a column with the mean value of this column
df_clean['A'].fillna(value=df_clean['A'].mean())

0    1.0
1    2.0
2    1.5
Name: A, dtype: float64

In [85]:
# Renaming a specific column name
df_clean.rename(columns={'A':'X'})

Unnamed: 0,X,B,C
0,1.0,5.0,1
1,2.0,,2
2,,,3


In [86]:
# Replace all values equal to 1 with 'one'
df_clean.replace(1,'one')

Unnamed: 0,A,B,C
0,one,5.0,one
1,2,,2
2,,,3


In [87]:
# Convert the datatype of the DataFrame to float
df_clean['A'].astype(float)

0    1.0
1    2.0
2    NaN
Name: A, dtype: float64

==========

### Data Manipulation

### Sorting, Grouping & Pivot Tables

| Method / Attribute                                         	| Description                                                                       	|
|------------------------------------------------------------	|-----------------------------------------------------------------------------------	|
| df[df[col] > 0.6]                                          	| Rows where the column col is greater than 0.6                                     	|
| df[(df[col] > 0.6) & (df[col] < 0.8)]                      	| Rows where 0.8 > col > 0.6                                                        	|
| df.sort_values(col1)                                       	| Sort values by col1 in ascending order                                            	|
| df.sort_values(col2,ascending=False)                       	| Sort values by col2 in descending order.5                                         	|
| df.sort_values([col1,col2],ascending=[True,False])         	| Sort values by col1 in ascending order then col2 in descending order              	|
| df.groupby(col)                                            	| Returns a groupby object for values from one column                               	|
| df.groupby([col1,col2])                                    	| Returns groupby object for values from multiple columns                           	|
| df.groupby(col1)[col2]                                     	| Returns the mean of the values in col2, grouped by the values in col1             	|
| df.pivot_table(index=col1,values=[col2,col3],aggfunc=mean) 	| Create a pivot table that groups by col1 and calculates the mean of col2 and col3 	|
| df.groupby(col1).agg(np.mean)                              	| Find the average across all columns for every unique col1 group                   	|
| df.apply(np.mean)                                          	| Apply the function np.mean() across each column                                   	|
| nf.apply(np.max,axis=1)                                    	| Apply the function np.max() across each row                                       	|

In [88]:
# Create dataframe
data = {'Company':['GOOG','GOOG','MSFT','MSFT','FB','FB'],
       'Person':['Sam','Charlie','Amy','Vanessa','Carl','Sarah'],
       'Sales':[200,120,340,124,243,350]}
df_group = pd.DataFrame(data)
df_group

Unnamed: 0,Company,Person,Sales
0,GOOG,Sam,200
1,GOOG,Charlie,120
2,MSFT,Amy,340
3,MSFT,Vanessa,124
4,FB,Carl,243
5,FB,Sarah,350


#### Sorting

In [89]:
# Sorting the values
df_group.sort_values('Sales')

Unnamed: 0,Company,Person,Sales
1,GOOG,Charlie,120
3,MSFT,Vanessa,124
0,GOOG,Sam,200
4,FB,Carl,243
2,MSFT,Amy,340
5,FB,Sarah,350


In [90]:
# Sorting the values descending
df_group.sort_values('Sales', ascending=False)

Unnamed: 0,Company,Person,Sales
5,FB,Sarah,350
2,MSFT,Amy,340
4,FB,Carl,243
0,GOOG,Sam,200
3,MSFT,Vanessa,124
1,GOOG,Charlie,120


In [91]:
# Sorting the indecies
df_group.sort_index()

Unnamed: 0,Company,Person,Sales
0,GOOG,Sam,200
1,GOOG,Charlie,120
2,MSFT,Amy,340
3,MSFT,Vanessa,124
4,FB,Carl,243
5,FB,Sarah,350


#### Grouping

In [92]:
# Grouping rows together based off of a column name, and then call aggregate methods off
df_group.groupby('Company').mean()

Unnamed: 0_level_0,Sales
Company,Unnamed: 1_level_1
FB,296.5
GOOG,160.0
MSFT,232.0


In [93]:
# Also, you can get the summary statistics for each company
df_group.groupby(['Company','Person']).describe()

Unnamed: 0_level_0,Unnamed: 1_level_0,Sales,Sales,Sales,Sales,Sales,Sales,Sales,Sales
Unnamed: 0_level_1,Unnamed: 1_level_1,count,mean,std,min,25%,50%,75%,max
Company,Person,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2
FB,Carl,1.0,243.0,,243.0,243.0,243.0,243.0,243.0
FB,Sarah,1.0,350.0,,350.0,350.0,350.0,350.0,350.0
GOOG,Charlie,1.0,120.0,,120.0,120.0,120.0,120.0,120.0
GOOG,Sam,1.0,200.0,,200.0,200.0,200.0,200.0,200.0
MSFT,Amy,1.0,340.0,,340.0,340.0,340.0,340.0,340.0
MSFT,Vanessa,1.0,124.0,,124.0,124.0,124.0,124.0,124.0


#### Pivot Tables

In [94]:
df_group.pivot_table(columns='Company', aggfunc=max)

Company,FB,GOOG,MSFT
Person,Sarah,Sam,Vanessa
Sales,350,200,340


==========

### Pandas Statistics

| Method / Attribute 	| Description                                                    	|
|--------------------	|----------------------------------------------------------------	|
| df.describe()      	| Summary statistics for numerical columns                       	|
| df.mean()          	| Returns the mean of all columns                                	|
| df.corr()          	| Returns the correlation between columns in a DataFrame         	|
| df.count()         	| Returns the number of non-null values in each DataFrame column 	|
| df.max()           	| Returns the highest value in each column                       	|
| df.min()           	| Returns the lowest value in each column                        	|
| df.median()        	| Returns the median of each column                              	|
| df.std()           	| Returns the standard deviation of each column                  	|

In [95]:
df

Unnamed: 0,W,X,Y,Z
A,0.516399,0.570668,0.028474,0.171522
B,0.685277,0.833897,0.306966,0.893613
C,0.721544,0.189939,0.554228,0.352132
D,0.181892,0.785602,0.965483,0.232354
E,0.083561,0.603548,0.728993,0.276239
F,0.685306,0.517867,0.048485,0.137869


In [96]:
df.describe()

Unnamed: 0,W,X,Y,Z
count,6.0,6.0,6.0,6.0
mean,0.478997,0.583587,0.438771,0.343955
std,0.279279,0.229481,0.377631,0.279798
min,0.083561,0.189939,0.028474,0.137869
25%,0.265519,0.531068,0.113105,0.18673
50%,0.600838,0.587108,0.430597,0.254296
75%,0.685299,0.740088,0.685301,0.333159
max,0.721544,0.833897,0.965483,0.893613


In [97]:
# Calculating the mean through a specific column
df['X'].mean()

0.5835868436769396

In [98]:
# Let's find the median of each row
df.median(axis=1)

A    0.343960
B    0.759587
C    0.453180
D    0.508978
E    0.439894
F    0.327868
dtype: float64

In [99]:
# Find the argument of the min value of a specific column
df['W'].argmin()

4

In [100]:
# Find the row name of the min value of a specific column
df['W'].idxmin()

'E'

In [None]:
df.loc[df['W'].argmin()]
df.iloc[df['W'].idxmin()]

==========

## Case-study: Salaries EDA

This dataset can be found on Kaggle:
https://www.kaggle.com/kaggle/sf-salaries

#### Import pandas as pd

In [101]:
import pandas as pd

#### Read Salaries.csv as a dataframe

In [102]:
sal = pd.read_csv('data/Salaries.csv')

#### Check the head of the DataFrame

In [103]:
sal.head()

Unnamed: 0,Id,EmployeeName,JobTitle,BasePay,OvertimePay,OtherPay,Benefits,TotalPay,TotalPayBenefits,Year,Notes,Agency,Status
0,1,NATHANIEL FORD,GENERAL MANAGER-METROPOLITAN TRANSIT AUTHORITY,167411.18,0.0,400184.25,,567595.43,567595.43,2011,,San Francisco,
1,2,GARY JIMENEZ,CAPTAIN III (POLICE DEPARTMENT),155966.02,245131.88,137811.38,,538909.28,538909.28,2011,,San Francisco,
2,3,ALBERT PARDINI,CAPTAIN III (POLICE DEPARTMENT),212739.13,106088.18,16452.6,,335279.91,335279.91,2011,,San Francisco,
3,4,CHRISTOPHER CHONG,WIRE ROPE CABLE MAINTENANCE MECHANIC,77916.0,56120.71,198306.9,,332343.61,332343.61,2011,,San Francisco,
4,5,PATRICK GARDNER,"DEPUTY CHIEF OF DEPARTMENT,(FIRE DEPARTMENT)",134401.6,9737.0,182234.59,,326373.19,326373.19,2011,,San Francisco,


#### Use the .info() method to find out how many entries there are

In [104]:
sal.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 148654 entries, 0 to 148653
Data columns (total 13 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   Id                148654 non-null  int64  
 1   EmployeeName      148654 non-null  object 
 2   JobTitle          148654 non-null  object 
 3   BasePay           148045 non-null  float64
 4   OvertimePay       148650 non-null  float64
 5   OtherPay          148650 non-null  float64
 6   Benefits          112491 non-null  float64
 7   TotalPay          148654 non-null  float64
 8   TotalPayBenefits  148654 non-null  float64
 9   Year              148654 non-null  int64  
 10  Notes             0 non-null       float64
 11  Agency            148654 non-null  object 
 12  Status            0 non-null       float64
dtypes: float64(8), int64(2), object(3)
memory usage: 14.7+ MB


#### What is the average BasePay ?

In [105]:
sal['BasePay'].mean()

66325.44884050643

#### What is the highest amount of OvertimePay in the dataset ?

In [106]:
sal['OvertimePay'].max()

245131.88

#### What is the job title of JOSEPH DRISCOLL ? Note: Use all caps, otherwise you may get an answer that doesn't match up (there is also a lowercase Joseph Driscoll).

In [107]:
sal[sal['EmployeeName']=='JOSEPH DRISCOLL']['JobTitle']

24    CAPTAIN, FIRE SUPPRESSION
Name: JobTitle, dtype: object

#### How much does JOSEPH DRISCOLL make (including benefits)? 

In [108]:
sal[sal['EmployeeName']=='JOSEPH DRISCOLL']['TotalPayBenefits']

24    270324.91
Name: TotalPayBenefits, dtype: float64

#### What is the name of highest paid person (including benefits)?

In [109]:
sal[sal['TotalPayBenefits']== sal['TotalPayBenefits'].max()]#['EmployeeName']
# or
# sal.loc[sal['TotalPayBenefits'].idxmax()]

Unnamed: 0,Id,EmployeeName,JobTitle,BasePay,OvertimePay,OtherPay,Benefits,TotalPay,TotalPayBenefits,Year,Notes,Agency,Status
0,1,NATHANIEL FORD,GENERAL MANAGER-METROPOLITAN TRANSIT AUTHORITY,167411.18,0.0,400184.25,,567595.43,567595.43,2011,,San Francisco,


#### What is the name of lowest paid person (including benefits)? Do you notice something strange about how much he or she is paid?

In [110]:
sal[sal['TotalPayBenefits']== sal['TotalPayBenefits'].min()] #['EmployeeName']
# or
# sal.loc[sal['TotalPayBenefits'].idxmax()]['EmployeeName']

## ITS NEGATIVE!! VERY STRANGE

Unnamed: 0,Id,EmployeeName,JobTitle,BasePay,OvertimePay,OtherPay,Benefits,TotalPay,TotalPayBenefits,Year,Notes,Agency,Status
148653,148654,Joe Lopez,"Counselor, Log Cabin Ranch",0.0,0.0,-618.13,0.0,-618.13,-618.13,2014,,San Francisco,


#### What was the average (mean) BasePay of all employees per year? (2011-2014) ?

In [111]:
sal.groupby('Year').mean()['BasePay']

Year
2011    63595.956517
2012    65436.406857
2013    69630.030216
2014    66564.421924
Name: BasePay, dtype: float64

#### How many unique job titles are there?

In [112]:
sal['JobTitle'].nunique()

2159

#### What are the top 5 most common jobs? 

In [113]:
sal['JobTitle'].value_counts().head(5)

Transit Operator                7036
Special Nurse                   4389
Registered Nurse                3736
Public Svc Aide-Public Works    2518
Police Officer 3                2421
Name: JobTitle, dtype: int64

#### How many Job Titles were represented by only one person in 2013? (e.g. Job Titles with only one occurence in 2013?)

In [114]:
sum(sal[sal['Year']==2013]['JobTitle'].value_counts() == 1)
# pretty tricky way to do this...

202

#### How many people have the word Chief in their job title? (This is pretty tricky)

In [115]:
def chief_string(title):
    if 'chief' in title.lower():
        return True
    else:
        return False

In [116]:
sum(sal['JobTitle'].apply(lambda x: chief_string(x)))

627

#### Bonus: Is there a correlation between length of the Job Title string and Salary?

In [117]:
sal['title_len'] = sal['JobTitle'].apply(len)

In [118]:
sal[['title_len','TotalPayBenefits']].corr() # No correlation.

Unnamed: 0,title_len,TotalPayBenefits
title_len,1.0,-0.036878
TotalPayBenefits,-0.036878,1.0


==========

# THANK YOU!