<a href="https://colab.research.google.com/github/akdubey/AKDU/blob/main/Lecture_6b_Creating_a_Pandas_Dataframe%2C_Data_Selection_in_DataFrame%2C_Working_with_Missing_Values.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Topics to be covered

* Introduction to DataFrames
* Creating a Pandas Dataframe
* Data Selection in DataFrame
* Working with Missing Values

####  A DataFrame, is often used for analytical purposes and is betterunderstood when thought of as column oriented, where each column is a Series.

> * #### Pandas DataFrame is the Data Structure, which is a 2 dimensional Array. 
* #### You can say that multiple Pandas Series make a Pandas DataFrame.


# Import the  Package

In [1]:
import pandas as pd
import numpy as np

# Creating Pandas DataFrame from the dictionary

In [2]:
# Dictionary
data={'student': ['Jack','Mike','Rohan','Zubair'], 'year':[1,2,3,1], 'marks':[9.8,6.7,8,9.9]}

In [3]:
df = pd.DataFrame(data)
df

Unnamed: 0,student,year,marks
0,Jack,1,9.8
1,Mike,2,6.7
2,Rohan,3,8.0
3,Zubair,1,9.9


### Note -  the type of column is a pandas Series instance

In [4]:
# type of column
type(df['student'])

pandas.core.series.Series

In [5]:
# any operation that can be done to a series can be applied to a column
df['student'].str.upper()

0      JACK
1      MIKE
2     ROHAN
3    ZUBAIR
Name: student, dtype: object

# Construction of DataFrame

#### Data frames can be created from many types of input:
* columns (dicts of lists)
* rows (list of dicts)
* CSV file (pd.read_csv)
* from NumPy ndarray
* And more, SQL, HDF5, etc

# Construction from columns (dicts of lists)

In [6]:
# Dictionary
data={'student': ['Jack','Mike','Rohan','Zubair'], 'year':[1,2,3,1], 'marks':[9.8,6.7,8,9.9]}
df = pd.DataFrame(data)
df

Unnamed: 0,student,year,marks
0,Jack,1,9.8
1,Mike,2,6.7
2,Rohan,3,8.0
3,Zubair,1,9.9


# Construction from rows (list of dicts)

In [7]:
#Example - 1
l = [{'student':'Jack','year':1,'marks':9.8},{'student':'Mike','year':2,'marks':6.7}]
df = pd.DataFrame(l)
df

Unnamed: 0,student,year,marks
0,Jack,1,9.8
1,Mike,2,6.7


In [8]:
#Example - 2
l = [{'student':'Jack','year':1,'marks':9.8},{'student':'Mike','year':2}]
df = pd.DataFrame(l)
df

Unnamed: 0,student,year,marks
0,Jack,1,9.8
1,Mike,2,


# Construction from csv file

In [30]:
# df = pd.read_csv('DataSets/Salary_Data.csv')

In [14]:
#The OS module in Python provides functions for creating and removing a directory (folder), fetching its contents, changing and identifying the current directory, etc. You first need to import the os module to interact with the underlying operating system
import os

In [16]:
# Mount Your Drive 
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [18]:
dirname = '/content/drive/My Drive/Training/Salary_Data.csv'

# Construction from Numpy ndArray

In [19]:
arr = np.random.randint(1,100,(10,5))
df = pd.DataFrame(arr,columns=['a','b','c','d','e'])
df

Unnamed: 0,a,b,c,d,e
0,21,86,90,80,90
1,68,98,33,14,21
2,6,67,56,76,22
3,37,77,46,90,30
4,8,92,71,1,27
5,34,16,20,92,35
6,96,32,75,32,11
7,86,39,79,64,20
8,26,40,95,95,44
9,50,56,84,34,48


# Data Frame Axis

* #### Unlike a series, which has one axis, there are two axes for a data frame.
* #### They are commonly referred to as axis 0 and 1, or the row axis and the columns axis respectively

In [21]:
df.axes

[RangeIndex(start=0, stop=10, step=1),
 Index(['a', 'b', 'c', 'd', 'e'], dtype='object')]

In [22]:
# it is important to remember that 0 is the index and 1 is the columns 
df.axes[0]

RangeIndex(start=0, stop=10, step=1)

In [23]:
df.axes[1]

Index(['a', 'b', 'c', 'd', 'e'], dtype='object')

# Data Selection in DataFrame

In [25]:
df

Unnamed: 0,a,b,c,d,e
0,21,86,90,80,90
1,68,98,33,14,21
2,6,67,56,76,22
3,37,77,46,90,30
4,8,92,71,1,27
5,34,16,20,92,35
6,96,32,75,32,11
7,86,39,79,64,20
8,26,40,95,95,44
9,50,56,84,34,48


### The individual column (Series) of the DataFrame can be accessed via dictionary-style indexing of the column name.

In [31]:
# df['Name of column']
# name = df['Salary']
# name

In [32]:
# type(name)

### we can use attribute-style access

# Note - 

Attribute-style access does not work for all cases.

For example, if the column names are not strings, or if the column names conflict with methods of the DataFrame

In [34]:
df1  = pd.read_csv('Salary_Data1.csv')

FileNotFoundError: ignored

In [35]:
df1

NameError: ignored

In [36]:
# rank is dataframe method
df1.rank

NameError: ignored

### We can also use above seleting style to add new column like ictionary-style

In [38]:
df1['Annual_Salary'] = df1['Salary']*12
df1

NameError: ignored

### Access dataframe as 2d-array

In [39]:
df1.values

NameError: ignored

### we can transpose the full DataFrame to swap rows and columns

In [40]:
df1.T

NameError: ignored

# Pandas uses the .loc, .iloc also

In [41]:
df1.iloc[0:3]

NameError: ignored

In [42]:
df1.iloc[:3,:2]

NameError: ignored

In [43]:
df1.iloc[:,[0,2,4]]

NameError: ignored

In [44]:
df1.loc[:2,:'Salary']

NameError: ignored

### Selecting data using masking and fancy indexing

In [45]:
# Masking
df1[df1.Salary>11000]

NameError: ignored

In [46]:
# masking and fancy indexing

df1.loc[df1.Salary>11000,['Name','rank']]

NameError: ignored

In [None]:
# Update rank of alex
df1.loc[df1.Name=='Alex','rank']=10

In [47]:
df1.loc[df1.Name=='Alex']

NameError: ignored

# Handling Missing Data

* Many interesting datasets will have some amount of data missing. 
* Different data sources may indicate missing data in different ways.
* In this section we’ll discussed missing data in general as null, NaN, or NA values.

* NaN ==> acronym for Not a Number

* It is a special floating-point value recognized by all systems


#### NaN and None both have their place, and Pandas is built to handle the two of them.


In [48]:
# Pandas automatically type-casts when NA values are present.
v1 = pd.Series([1, np.nan, 2, None])
v1

0    1.0
1    NaN
2    2.0
3    NaN
dtype: float64

# Working on Null Values

There are several useful methods for detecting, removing, and replacing null values in Pandas data structures

* #### isnull() => Generate a Boolean mask indicating missing values
* #### notnull() => Opposite of isnull()
* #### dropna()  => Return a filtered version of the data by droping
* #### fillna() => Return a copy of the data with missing values filled or imputed

# isnull()

In [49]:
data = pd.read_csv('Salary_Data_null.csv')
data

FileNotFoundError: ignored

In [None]:
data.isnull()

In [None]:
data.isnull().sum()

# notnull()

In [50]:
data.notnull()

AttributeError: ignored

# dropna()

Note - We cannot drop single values from a DataFrame; we can only drop full rows or full columns.

In [51]:
# Drop by axis = 0 ( row)
# By default - dropna() will drop all rows in which any null value is present
data.dropna()

AttributeError: ignored

In [None]:
# Drop by axis =1 (column)
# you can drop NA values along a different axis; axis=1 
# drops all columns containing a null value
data.dropna(axis=1)

Note - But this drops some good data as well; you might rather be interested in dropping rows or columns with all NA values, or a majority of NA values

In [52]:
data.dropna(how='all')

AttributeError: ignored

In [None]:
data.drop(columns='Age')

In [None]:
data.dropna(thresh=3)

# fillna()

Sometimes rather than dropping NA values, you’d rather replace them with a valid value.

In [53]:
data

{'student': ['Jack', 'Mike', 'Rohan', 'Zubair'],
 'year': [1, 2, 3, 1],
 'marks': [9.8, 6.7, 8, 9.9]}

In [54]:
# Fill with 0

data.fillna(0)

AttributeError: ignored

In [55]:
# forward-fill to propagate the previous value forward
data.fillna(method='ffill')

AttributeError: ignored

# Note - 

Notice that if a previous value is not available during a forward fill, the NA value remains

In [56]:
# back-fill to propagate the next values backward

data.fillna(method='bfill')

AttributeError: ignored