In [1]:
import numpy as np
import pandas as pd

**pandas**
- pandas is a Python package providing fast, flexible, and expressive data structures designed to make working with “relational” or “labeled” data both easy and intuitive. It aims to be the fundamental high-level building block for doing practical, real-world data analysis in Python. Additionally, it has the broader goal of becoming the most powerful and flexible open source data analysis/manipulation tool available in any language

- pandas is well suited for many different kinds of data:

    - Tabular data with heterogeneously-typed columns, as in an SQL table or Excel spreadsheet

    - Ordered and unordered (not necessarily fixed-frequency) time series data.

    - Arbitrary matrix data (homogeneously typed or heterogeneous) with row and column labels

    - Any other form of observational / statistical data sets. The data need not be labeled at all to be placed into a pandas data structure



**Data Structures in Pandas**

- Series :
    - Pandas Series is a one-dimensional labeled array capable of holding any data type. So, in terms of Pandas DataStructure, A Series represents a single column in memory, which is either independent or belongs to a Pandas DataFrame.The object supports both integer and label-based indexing and provides a host of methods for performing operations involving the index.
    
- DataFrame :
    - Pandas DataFrame is two-dimensional size-mutable, potentially heterogeneous tabular data structure with labeled axes known as rows and columns. a dataframe is a collection of series that can be used to analyse the data.  DataFrame can be created from the lists, dictionary, and from a list of dictionaries etc.



#### series

**Creating a Series Using Default Indexing**

In [7]:
data = np.array(['a','b','c',99])

s = pd.Series(data)

s

0     a
1     b
2     c
3    99
dtype: object

**Creating a Series Using Custom Indexing**

In [8]:
data = np.array(['a','b','c','d'])

s = pd.Series(data, index=[101,102,103,108])
s

101    a
102    b
103    c
108    d
dtype: object

**Creating a Series from Dictionary**

In [10]:
data = {'a':1,'b':2,'c':3.0}
s = pd.Series(data)
s

a    1.0
b    2.0
c    3.0
dtype: float64

**Accessing Data from Series Using Position**

In [12]:
s = pd.Series([1,2,3,4,5], index=['a','b','c','d','e'])

print(s[0])
print(s[3])

1
4


**Accessing Data from Series Using Index labels**

In [16]:
s = pd.Series([1,2,3,4,5], index=['a','b','c','d','e'])

# retrieve a single element
print(s['a'])
print(s['e'])
print(s[3])

1
5
4


**Accessing First 3 First Elements**

In [17]:
s = pd.Series([1,2,3,4,5], index=['a','b','c','d','e'])
print(s[:3])

a    1
b    2
c    3
dtype: int64


In [18]:
print(s[::-1])

e    5
d    4
c    3
b    2
a    1
dtype: int64


**Accessing Multiple Elements using Index Label**

In [19]:
s[['a','c','d']]

a    1
c    3
d    4
dtype: int64

**dataframe**
- Pandas DataFrame is a two-dimensional size-mutable, potentially heterogeneous tabular data structure with labeled axes (rows and columns).
- In general, we can say that the Pandas DataFrame consists of three main components: the data, the index, and the columns. DataFrames are extremely important going forward, as we can read & store excel sheets into DataFrames and use many manipulation techniques on them, as we’ll learn ahead.


**Creating Dataframe using list**


In [21]:
lis = ['usa','covid','cases','are', 'counting']

df= pd.DataFrame(lis)
df

Unnamed: 0,0
0,usa
1,covid
2,cases
3,are
4,counting


**Creating dataframe from dict of ndarray/lists**

In [20]:
dict1 = {
    "name":['harry','rohan','skillf','shubh'],
    "marks":[92,34,24,17],
    "city":["rampur",'kolkata','bareilly','antarctica']
}

In [5]:
df = pd.DataFrame(dict1)

df

Unnamed: 0,name,marks,city
0,harry,92,rampur
1,rohan,34,kolkata
2,skillf,24,bareilly
3,shubh,17,antarctica


**Create a DataFrame from List of Dicts**

In [22]:
data = [{'a':1,'b':2},{'a':5,'b':10,'c':20}]

df=pd.DataFrame(data)
df

Unnamed: 0,a,b,c
0,1,2,
1,5,10,20.0


In [23]:
# with two column indices, values same as dictionary keys

df1 = pd.DataFrame(data,index=['first','second'],columns=['a','b'])

df1

Unnamed: 0,a,b
first,1,2
second,5,10


**Coloum Selection**

In [24]:
df1.a

first     1
second    5
Name: a, dtype: int64

In [26]:
dict1 = {
    "name":['harry','rohan','skillf','shubh'],
    "marks":[92,34,24,17],
    "city":["rampur",'kolkata','bareilly','antarctica']
}
df = pd.DataFrame(dict1)

print(df.city)

print(df.name)

0        rampur
1       kolkata
2      bareilly
3    antarctica
Name: city, dtype: object
0     harry
1     rohan
2    skillf
3     shubh
Name: name, dtype: object


**Creating empty Dataframe**

In [5]:
df=pd.DataFrame()
df=pd.DataFrame(columns=["col1","col2"])
df

Unnamed: 0,col1,col2


In [7]:

data = {'Name':['POD1', 'POD2', 'POD3', 'POD4'],
        'Members':[20, 21, 19, 18]}
 
# Create DataFrame
df = pd.DataFrame(data)
 
# Print the output.
print(df)

   Name  Members
0  POD1       20
1  POD2       21
2  POD3       19
3  POD4       18


   **Column Selection in DataFrame**

In [8]:
df[['Name']]

Unnamed: 0,Name
0,POD1
1,POD2
2,POD3
3,POD4


**Multiple Column Selection in a DataFrame**

In [9]:
df[['Name',"Members"]]

Unnamed: 0,Name,Members
0,POD1,20
1,POD2,21
2,POD3,19
3,POD4,18


**Row Selection in a DataFrame**

In [11]:
# Row Selection

# retrieving row by iloc method

# Retrieving the row at 0th index
df.iloc[0]

Name       POD1
Members      20
Name: 0, dtype: object

In [12]:
#retrieving the row at last index
df.iloc[-1]

Name       POD4
Members      18
Name: 3, dtype: object

**Slicing DataFrame by Frame**

In [13]:
df.iloc[0:2]

Unnamed: 0,Name,Members
0,POD1,20
1,POD2,21


In [14]:
df.iloc[-2:]

Unnamed: 0,Name,Members
2,POD3,19
3,POD4,18


**Dropping Column or Rows in a DataFrame**

In [15]:
# Dropping Columns in a Dataframe
# Here axis = 1 is used for Column and Axis = 0 is used for Rows
df.drop(['Name'],axis=1)

Unnamed: 0,Members
0,20
1,21
2,19
3,18


In [16]:
# Dropping Rows in a Dataframe, using index of the row.
# Here axis = 1 is used for Column and Axis = 0 is used for Rows.
df.drop(0,axis=0)

Unnamed: 0,Name,Members
1,POD2,21
2,POD3,19
3,POD4,18


**Transpose a Dataframe**

In [17]:
print(df)

df.T

   Name  Members
0  POD1       20
1  POD2       21
2  POD3       19
3  POD4       18


Unnamed: 0,0,1,2,3
Name,POD1,POD2,POD3,POD4
Members,20,21,19,18


**Extracing Columns of a DataFrame**

In [19]:
df.columns

Index(['Name', 'Members'], dtype='object')

**Getting Datatypes of Columns in a Dataframe**

In [21]:
df.dtypes

Name       object
Members     int64
dtype: object

**Introduction to Reading and Saving Dataframes**
- csv
- excel

**Value Distribution of DataFrame**

In [23]:
df.value_counts()

Name  Members
POD1  20         1
POD2  21         1
POD3  19         1
POD4  18         1
dtype: int64

### Reading and Saving Dataframes

**csv**
- A comma-separated values (CSV) file is a plaintext file with a .csv extension that holds tabular data. This is one of the most popular file formats for storing large amounts of data.
- Each row of the CSV file represents a single table row. The values in the same row are by default separated with commas, but you could change the separator to a semicolon, tab, space, or some other character.

<img src="https://www.mathworks.com/help/simulink/ug/sdi_import_csv_basic.png"/>

**Excel**
<br>
Excel is a popular spreadsheet program used with data like numbers and formulas, text, and drawing shapes.<br>
XLS files use a Binary
Interchange File Format to store spreadsheet data and are proprietary to Microsoft

<img src="https://static.spreadsheetweb.com/ssweb/wp-content/uploads/2021/01/How-to-avoid-formatting-change-on-CSV-files-in-Excel-02.png"/>

**Write a CSV File**
- to save dataframe

In [28]:
dict1 = {
    "name":['harry','rohan','skillf','shubh'],
    "marks":[92,34,24,17],
    "city":["rampur",'kolkata','bareilly','antarctica']
}
df = pd.DataFrame(dict1)

In [31]:
df

Unnamed: 0,name,marks,city
0,harry,92,rampur
1,rohan,34,kolkata
2,skillf,24,bareilly
3,shubh,17,antarctica


In [32]:
df.to_csv('friends.csv')

In [24]:
#create a data dictionary

data = {
    'CHN': {'COUNTRY': 'China', 'POP': 1_398.72, 'AREA': 9_596.96,
            'GDP': 12_234.78, 'CONT': 'Asia'},
    'IND': {'COUNTRY': 'India', 'POP': 1_351.16, 'AREA': 3_287.26,
            'GDP': 2_575.67, 'CONT': 'Asia', 'IND_DAY': '1947-08-15'},
    'USA': {'COUNTRY': 'US', 'POP': 329.74, 'AREA': 9_833.52,
            'GDP': 19_485.39, 'CONT': 'N.America',
            'IND_DAY': '1776-07-04'},
    'IDN': {'COUNTRY': 'Indonesia', 'POP': 268.07, 'AREA': 1_910.93,
            'GDP': 1_015.54, 'CONT': 'Asia', 'IND_DAY': '1945-08-17'},
    'BRA': {'COUNTRY': 'Brazil', 'POP': 210.32, 'AREA': 8_515.77,
            'GDP': 2_055.51, 'CONT': 'S.America', 'IND_DAY': '1822-09-07'},
    'PAK': {'COUNTRY': 'Pakistan', 'POP': 205.71, 'AREA': 881.91,
            'GDP': 302.14, 'CONT': 'Asia', 'IND_DAY': '1947-08-14'},
    
}

columns = ('COUNTRY', 'POP', 'AREA', 'GDP', 'CONT', 'IND_DAY')

In [27]:
# creating a data frame 
# data is organized in such a way that the country codes correspond to columns. 
# to reverse(Transpose) the rows and columns of a DataFrame with the
df = pd.DataFrame(data)
df

Unnamed: 0,CHN,IND,USA,IDN,BRA,PAK
COUNTRY,China,India,US,Indonesia,Brazil,Pakistan
POP,1398.72,1351.16,329.74,268.07,210.32,205.71
AREA,9596.96,3287.26,9833.52,1910.93,8515.77,881.91
GDP,12234.78,2575.67,19485.39,1015.54,2055.51,302.14
CONT,Asia,Asia,N.America,Asia,S.America,Asia
IND_DAY,,1947-08-15,1776-07-04,1945-08-17,1822-09-07,1947-08-14


In [29]:
df=df.T
df

Unnamed: 0,COUNTRY,POP,AREA,GDP,CONT,IND_DAY
CHN,China,1398.72,9596.96,12234.78,Asia,
IND,India,1351.16,3287.26,2575.67,Asia,1947-08-15
USA,US,329.74,9833.52,19485.39,N.America,1776-07-04
IDN,Indonesia,268.07,1910.93,1015.54,Asia,1945-08-17
BRA,Brazil,210.32,8515.77,2055.51,S.America,1822-09-07
PAK,Pakistan,205.71,881.91,302.14,Asia,1947-08-14


In [33]:
df.to_csv('data.csv')

# index=False for removing the index  0,1,2... or whatever the index in cs

**Without remove the index**

In [34]:
df=pd.read_csv('data.csv')
df

Unnamed: 0.1,Unnamed: 0,COUNTRY,POP,AREA,GDP,CONT,IND_DAY
0,0,China,1398.72,9596.96,12234.78,Asia,
1,1,India,1351.16,3287.26,2575.67,Asia,1947-08-15
2,2,US,329.74,9833.52,19485.39,N.America,1776-07-04
3,3,Indonesia,268.07,1910.93,1015.54,Asia,1945-08-17
4,4,Brazil,210.32,8515.77,2055.51,S.America,1822-09-07
5,5,Pakistan,205.71,881.91,302.14,Asia,1947-08-14


**With Removed Index**

In [46]:
df=pd.DataFrame(data).T

df.to_csv('data.csv',index=False)

df=pd.read_csv('data.csv')
df

Unnamed: 0,COUNTRY,POP,AREA,GDP,CONT,IND_DAY
0,China,1398.72,9596.96,12234.78,Asia,
1,India,1351.16,3287.26,2575.67,Asia,1947-08-15
2,US,329.74,9833.52,19485.39,N.America,1776-07-04
3,Indonesia,268.07,1910.93,1015.54,Asia,1945-08-17
4,Brazil,210.32,8515.77,2055.51,S.America,1822-09-07
5,Pakistan,205.71,881.91,302.14,Asia,1947-08-14


### To save any Python dataframe as a excel (.xlsx) file

In [47]:
df.to_excel('data.xlsx')

In [52]:
# reading an excel file

df=pd.read_excel('data.xlsx',index_col=0)
df

Unnamed: 0,COUNTRY,POP,AREA,GDP,CONT,IND_DAY
0,China,1398.72,9596.96,12234.78,Asia,
1,India,1351.16,3287.26,2575.67,Asia,1947-08-15
2,US,329.74,9833.52,19485.39,N.America,1776-07-04
3,Indonesia,268.07,1910.93,1015.54,Asia,1945-08-17
4,Brazil,210.32,8515.77,2055.51,S.America,1822-09-07
5,Pakistan,205.71,881.91,302.14,Asia,1947-08-14


### DataFrame Opearations

In [53]:
import pandas as pd

In [54]:
df= pd.DataFrame({'Region':['West','North','South'],
                   'Company':['Costco','Walmart','Home Depot'],
                   'Product':['Dinner Set','Grocery','Gardening tools'],
                   'Month':['September','July','February'],
                   'Sales':[2500,3096,8795]})
df

Unnamed: 0,Region,Company,Product,Month,Sales
0,West,Costco,Dinner Set,September,2500
1,North,Walmart,Grocery,July,3096
2,South,Home Depot,Gardening tools,February,8795


In [57]:
# New Data Row for East Region:
# This is a data dictionary with the values of one Region - East that we want to enter in the above dataframe df.

data = [{'Region':'East','Company':'Shop Rite','Product':'Fruits','Month':'December','Sales': 1265}]


### adding a row using the append function

In [56]:
df.append(data,ignore_index=True,sort=False)

Unnamed: 0,Region,Company,Product,Month,Sales
0,West,Costco,Dinner Set,September,2500
1,North,Walmart,Grocery,July,3096
2,South,Home Depot,Gardening tools,February,8795
3,East,Shop Rite,Fruits,December,1265


### Adding Rows to the dataFrame

In [58]:
df.loc[3] = list(data[0].values())
df

Unnamed: 0,Region,Company,Product,Month,Sales
0,West,Costco,Dinner Set,September,2500
1,North,Walmart,Grocery,July,3096
2,South,Home Depot,Gardening tools,February,8795
3,East,Shop Rite,Fruits,December,1265


In [60]:
#using iloc to update row at index positio

df.iloc[1]=list(data[0].values())
df

Unnamed: 0,Region,Company,Product,Month,Sales
0,West,Costco,Dinner Set,September,2500
1,East,Shop Rite,Fruits,December,1265
2,South,Home Depot,Gardening tools,February,8795
3,East,Shop Rite,Fruits,December,1265


### Adding new Column to DataFram

In [66]:
purchase=[3000,400,3500,6000]
df.assign(Purchase=purchase)


Unnamed: 0,Region,Company,Product,Month,Sales,Purchase
0,West,Costco,Dinner Set,September,2500,3000
1,East,Shop Rite,Fruits,December,1265,400
2,South,Home Depot,Gardening tools,February,8795,3500
3,East,Shop Rite,Fruits,December,1265,6000


### Lets add these three list (Date, City, Purchase) as column to the existing dataframe  using assign with a dict of column names and values

In [67]:
Date=['1/9/2017','2/6/2018','7/12/2018','9/12/2018']
City = ['SFO', 'Chicago', 'Charlotte','denmark']
Purchase = [3000, 4000, 3500,5000]
      
df.assign(**{'City':City,'Date':Date,'Purchase':Purchase})


Unnamed: 0,Region,Company,Product,Month,Sales,City,Date,Purchase
0,West,Costco,Dinner Set,September,2500,SFO,1/9/2017,3000
1,East,Shop Rite,Fruits,December,1265,Chicago,2/6/2018,4000
2,South,Home Depot,Gardening tools,February,8795,Charlotte,7/12/2018,3500
3,East,Shop Rite,Fruits,December,1265,denmark,9/12/2018,5000


## Deleting Rows/Columns.
### Remove rows or columns by specifying label names and corresponding axis, or by specifying directly index or column names.
### Syntax: DataFrame.drop(labels=None, axis=0, index=None, columns=None,level=None,inplace=False, errors='raise') 

In [68]:
# using the above data frame for further operations

df

Unnamed: 0,Region,Company,Product,Month,Sales
0,West,Costco,Dinner Set,September,2500
1,East,Shop Rite,Fruits,December,1265
2,South,Home Depot,Gardening tools,February,8795
3,East,Shop Rite,Fruits,December,1265


In [70]:

# drop colums

df.drop(['Region','Company'],axis=1)

Unnamed: 0,Product,Month,Sales
0,Dinner Set,September,2500
1,Fruits,December,1265
2,Gardening tools,February,8795
3,Fruits,December,1265


In [72]:
df.drop(columns=['Region','Company'])

Unnamed: 0,Product,Month,Sales
0,Dinner Set,September,2500
1,Fruits,December,1265
2,Gardening tools,February,8795
3,Fruits,December,1265


In [73]:
# drop a row by index

df.drop([0,1])

Unnamed: 0,Region,Company,Product,Month,Sales
2,South,Home Depot,Gardening tools,February,8795
3,East,Shop Rite,Fruits,December,1265


### Sorting (ascending/descending)
#### The sort_values() function returns a sorted dataframe

In [77]:
import numpy as np

df = pd.DataFrame({
    'col1': ['A', 'A', 'B', np.nan, 'D', 'C'],
    'col2': [2, 1, 9, 8, 7, 4],
    'col3': [0, 1, 9, 4, 2, 3],
    'col4': ['a', 'B', 'c', 'D', 'e', 'F']
})

df

Unnamed: 0,col1,col2,col3,col4
0,A,2,0,a
1,A,1,1,B
2,B,9,9,c
3,,8,4,D
4,D,7,2,e
5,C,4,3,F


In [78]:
# sorted according to column 1 (ascending)

df.sort_values(by=['col1'])

Unnamed: 0,col1,col2,col3,col4
0,A,2,0,a
1,A,1,1,B
2,B,9,9,c
5,C,4,3,F
4,D,7,2,e
3,,8,4,D


In [79]:
# sorted according to column 1 (descending)

df.sort_values(by='col1',ascending=False)

Unnamed: 0,col1,col2,col3,col4
4,D,7,2,e
5,C,4,3,F
2,B,9,9,c
0,A,2,0,a
1,A,1,1,B
3,,8,4,D


### NULL Handing/Checking

#### First , we will create a dataframe in which we will take one of the value as missing (Null)

In [83]:
import pandas as pd

# initialize list of lists

data = [['Adam',10],['Steve',15],['John',]] #taking value (age) for john is Null


#creating DataFrame
df=pd.DataFrame(data,columns=["Name",'age'])
df

Unnamed: 0,Name,age
0,Adam,10.0
1,Steve,15.0
2,John,


### isna()
In this example, we have made use of isna() function to check for the presence of missing values. The cell of the dataframe containing missing values only returns TRUE, otherwise it returns FALSE.

In [84]:
df.isna()

Unnamed: 0,Name,age
0,False,False
1,False,False
2,False,True


In [85]:

#creating the series
ser= pd.Series([12,5,None,5,None,11])

ser

0    12.0
1     5.0
2     NaN
3     5.0
4     NaN
5    11.0
dtype: float64

In [86]:
ser.isna()

0    False
1    False
2     True
3    False
4     True
5    False
dtype: bool

## notna()
The notna() function returns TRUE, if the data is free from missing values else it returns FALSE (if NA values are encountered).

In [87]:
df.notna()

Unnamed: 0,Name,age
0,True,True
1,True,True
2,True,False


It returned False only for the Age of John (which was the only missing value in the dataframe).

### notnull()

it works similar to notna()

In [88]:
df

Unnamed: 0,Name,age
0,Adam,10.0
1,Steve,15.0
2,John,


In [89]:
df.notnull()

Unnamed: 0,Name,age
0,True,True
1,True,True
2,True,False


### replace()

In [90]:
df

Unnamed: 0,Name,age
0,Adam,10.0
1,Steve,15.0
2,John,


In [92]:
# this will replace "John with Michael"

df.replace(to_replace="John", value="Michael")

Unnamed: 0,Name,age
0,Adam,10.0
1,Steve,15.0
2,Michael,


In [93]:
# initialise data of lists.
data = {'Name':['Tom', 'nick', 'krish', 'jack','steve','David','Adam'],
        'Subject':['Maths','Bio','Phy','Bio','Maths','Bio','Phy']}

dff = pd.DataFrame(data)

dff

Unnamed: 0,Name,Subject
0,Tom,Maths
1,nick,Bio
2,krish,Phy
3,jack,Bio
4,steve,Maths
5,David,Bio
6,Adam,Phy


In [96]:

dff.replace(to_replace=["Bio",'Phy'], value='Science')
#Using python list as an argument We are going to replace “Phy” and “Bio” with “Science” in the dataframe.

Unnamed: 0,Name,Subject
0,Tom,Maths
1,nick,Science
2,krish,Science
3,jack,Science
4,steve,Maths
5,David,Science
6,Adam,Science


In [95]:
df

Unnamed: 0,Name,age
0,Adam,10.0
1,Steve,15.0
2,John,


### Replace NaN with value

In [108]:
import numpy as np

# replace missing value (age of John) with 12.0
# use numpy.nan for NaN values

df.replace(to_replace=np.nan,value=12.0)

Unnamed: 0,Name,age
0,Adam,10.0
1,Steve,15.0
2,John,12.0


## fillna

filling the NA values

In [110]:
df.replace(to_replace=12.0,value=np.nan)

Unnamed: 0,Name,age
0,Adam,10.0
1,Steve,15.0
2,John,


In [112]:

# filling values with mean of age

df['Age'].fillna(df['Age'].mean())

# Groupby


## Why/When Groupby?
  During your EDA/Data analysis/feature engineering there will always come a point where you would want to split the data based on certain groups/categories, and get relevant statistical inferences from it.

Let's see some scenerios below. let's start by making a dummy Data


In [2]:
import numpy as np
import pandas as pd

In [10]:
"""
  Lets start with a assumption that there is a class of 300 students and they have given there one fav subject a
  rating in range of 1-10 from 6 unique subject .
"""


n_studs=10
HouseOne =pd.DataFrame({
    "Name":["Name_" + str(i) for i in range(n_studs)],
    "Subject":np.random.choice(["Subject_"+str(i) for i in range(6)],size=n_studs),
    "Rating":np.random.uniform(low=1,high=10,size=n_studs),
    "Num":np.random.randint(low=1,high=10,size=n_studs)
})

**sample()** return 10 random points in any order

In [11]:
HouseOne.sample(10) # wlil give you 10 random points in any order

Unnamed: 0,Name,Subject,Rating,Num
6,Name_6,Subject_4,9.032693,9
1,Name_1,Subject_4,1.325285,6
8,Name_8,Subject_2,9.623728,8
2,Name_2,Subject_5,8.32396,5
7,Name_7,Subject_4,4.98911,5
0,Name_0,Subject_0,3.275702,8
4,Name_4,Subject_3,3.778605,8
3,Name_3,Subject_4,8.445388,4
9,Name_9,Subject_2,3.788089,5
5,Name_5,Subject_4,5.167409,2


Ok, now we have a sample data, let's see how much average rating each subject has

In [12]:
"""
Hold on here a second, notice groupby("Subject") will filter you data into groups of each individual unique subject and return a groupby OBJECT.!, 
if you iterate over these objects you get a tuple of group name(here, subname & the filtered data)
"""

for name,group in HouseOne.groupby("Subject"):
    print(f"Group Name is {name}")

Group Name is Subject_0
Group Name is Subject_2
Group Name is Subject_3
Group Name is Subject_4
Group Name is Subject_5


In [14]:
%%time
for name,group in HouseOne.groupby("Subject"):
    print(f"Subject Name is {name} and Subject Avg. rating is {group['Rating'].mean()}")
    # now we know groups represent the  subjects

Subject Name is Subject_0 and Subject Avg. rating is 3.2757020812192197
Subject Name is Subject_2 and Subject Avg. rating is 6.705908575486953
Subject Name is Subject_3 and Subject Avg. rating is 3.7786050651566807
Subject Name is Subject_4 and Subject Avg. rating is 5.791976932567244
Subject Name is Subject_5 and Subject Avg. rating is 8.32395956700491
Wall time: 2.99 ms


In [15]:

for name,group in HouseOne.groupby("Subject"):
    print(f"Now Printing Filtered Data of only : {name}")
    print("*"*50)
    print("*"*50)
    print(group.head(3))
    print("*"*50)

Now Printing Filtered Data of only : Subject_0
**************************************************
**************************************************
     Name    Subject    Rating  Num
0  Name_0  Subject_0  3.275702    8
**************************************************
Now Printing Filtered Data of only : Subject_2
**************************************************
**************************************************
     Name    Subject    Rating  Num
8  Name_8  Subject_2  9.623728    8
9  Name_9  Subject_2  3.788089    5
**************************************************
Now Printing Filtered Data of only : Subject_3
**************************************************
**************************************************
     Name    Subject    Rating  Num
4  Name_4  Subject_3  3.778605    8
**************************************************
Now Printing Filtered Data of only : Subject_4
**************************************************
**************************************************

In [17]:
HouseOne.groupby("Subject") #returns the group object

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x0000023308D0D550>

#### Aggregate functions
How to get Multiple, Satistical infrences from groups at once?
  Let's try to get mean, median, mode, count values of rating from each group

In [18]:
HouseOne.groupby("Subject")['Rating'].mean()

Subject
Subject_0    3.275702
Subject_2    6.705909
Subject_3    3.778605
Subject_4    5.791977
Subject_5    8.323960
Name: Rating, dtype: float64

In [19]:
HouseOne.groupby("Subject")['Rating'].agg(np.mean)

Subject
Subject_0    3.275702
Subject_2    6.705909
Subject_3    3.778605
Subject_4    5.791977
Subject_5    8.323960
Name: Rating, dtype: float64

In [22]:
HouseOne.groupby("Subject")["Num"].sum()

Subject
Subject_0     8
Subject_2    13
Subject_3     8
Subject_4    26
Subject_5     5
Name: Num, dtype: int32

In [23]:
HouseOne.groupby("Num")['Num'].count()

Num
2    1
4    1
5    3
6    1
8    3
9    1
Name: Num, dtype: int64

Check the time diffrence & match the values where we looped over group and now where we used short cute method, if the data is big enough this time difference would be significant., try it once by increasing n_rows from 300 to 3,000,000

In [24]:
"""
  There is one rule for Aggregate functions ----::---- Always remember the aggregate function assumes that function that you want to use will return a single value.
  for e.g : 
    for a column -> mean would return a single value which is the average of that column.
    but cant use a function like value_counts, which return multiple values. but i can filter the most/least frequent element from it to return as value see getmeMode function
  You can even pass your custom/user defined function, i am here going to pass a user defined function that returns mode of the series/column

"""

def getMeMode(x):
    return x.value_counts().index[0] # return vale of most frequent element
# Note: x is the entire column of rating of any particular group, and i am returning a *single* value which mode from my defined function.

stats = HouseOne.groupby("Subject")["Rating"].agg({"mean","median","count", np.sum, getMeMode}) # instead of np.sum you can also use "sum", getMeMode signifies as the address/refrence of function
stats # columns are returned in random order you can get a proper order by filtering columns in partcular order [ColA, ColB, ..., ColN] 

Unnamed: 0_level_0,median,mean,count,sum,getMeMode
Subject,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Subject_0,3.275702,3.275702,1,3.275702,3.275702
Subject_2,6.705909,6.705909,2,13.411817,9.623728
Subject_3,3.778605,3.778605,1,3.778605,3.778605
Subject_4,5.167409,5.791977,5,28.959885,1.325285
Subject_5,8.32396,8.32396,1,8.32396,8.32396


Aggregating Multiple Columns

In [27]:
HouseOne.groupby("Subject").agg({"Rating":"mean", "disceate_rating":"median","Grade":getMeMode })

### lambda function

Python Lambda Functions are anonymous function means that the function is without a name. As we already know that the def keyword is used to define a normal function in Python. Similarly, the lambda keyword is used to define an anonymous function in Python.


In [28]:

def a_name(x):
    return x+x

# lambda func
lambda x: x+x

<function __main__.<lambda>(x)>

In [30]:
df=pd.DataFrame()

df["Number"]=["ABC123","AZZ0011","XYZ555"]
df

Unnamed: 0,Number
0,ABC123
1,AZZ0011
2,XYZ555


In [31]:

df["Number_first_3_characters"] = df["Number"].apply(lambda x: x[:3])
df

Unnamed: 0,Number,Number_first_3_characters
0,ABC123,ABC
1,AZZ0011,AZZ
2,XYZ555,XYZ


## Join & Concat


In [32]:
import pandas as pd

In [33]:
Marks = pd.DataFrame({'name':['Walter','White','Saul','Goodman'],
                     'marks':[70,75,80,90]})



In [35]:
Age = pd.DataFrame({'name':['Walter','Saul','Goodman','Hank'],
                   'age':[21,22,20,24],
                   'Hobby':['Cooking','Reading','Playing','Collecting Minerals']})


## Join

Created two dictionaries named Age and Marks and converted them into Dataframes(Tables) on which we are going to perform all the joins.



In [36]:
Marks

Unnamed: 0,name,marks
0,Walter,70
1,White,75
2,Saul,80
3,Goodman,90


In [37]:
Age

Unnamed: 0,name,age,Hobby
0,Walter,21,Cooking
1,Saul,22,Reading
2,Goodman,20,Playing
3,Hank,24,Collecting Minerals


This is how both the tables initially looks like.

## Left Join

In [38]:
Left_Join = pd.merge(Marks, Age, on='name', how='left')
Left_Join

Unnamed: 0,name,marks,age,Hobby
0,Walter,70,21.0,Cooking
1,White,75,,
2,Saul,80,22.0,Reading
3,Goodman,90,20.0,Playing


Here we are performing simple Left Join using the tables we just created. Marks here is the left table as we can see in the code. Merge function is used, Age is the right name. By 'on' parameter, we set the id (the basis of the join) and by how we specify the type of join we want to perform.

Here we are performing simple Left Join using the tables we just created. Marks here is the left table as we can see in the code. Merge function is used, Age is the right name. By 'on' parameter, we set the id (the basis of the join) and by how we specify the type of join we want to perform.

## Right Join

In [39]:
Right_Join=pd.merge(Marks, Age, on='name',how='right')
Right_Join

Unnamed: 0,name,marks,age,Hobby
0,Walter,70.0,21,Cooking
1,Saul,80.0,22,Reading
2,Goodman,90.0,20,Playing
3,Hank,,24,Collecting Minerals


Similarly we perform can perform right join, by specifying 'how' as 'right'.

All the names (id) from the right(Age) table has been retained and their corresponding marks as obtained from the left table.

## Inner Join

In [40]:
Inner_Join= Marks.merge(Age, on = 'name', how = 'inner')
Inner_Join

Unnamed: 0,name,marks,age,Hobby
0,Walter,70,21,Cooking
1,Saul,80,22,Reading
2,Goodman,90,20,Playing


In the resultant table of the Inner join, we can see the names which are common in both, and their respective values from both tables are returned.

## Outer Join

In [41]:
Outer_Join= Marks.merge(Age,on='name',how='outer')
Outer_Join

Unnamed: 0,name,marks,age,Hobby
0,Walter,70.0,21.0,Cooking
1,White,75.0,,
2,Saul,80.0,22.0,Reading
3,Goodman,90.0,20.0,Playing
4,Hank,,24.0,Collecting Minerals


Similarly Outer Join returned all the rows associated with either of the table(Marks and Age)

## Concat

Concat is another function that comes with Pandas library which does a lot of heady lifting and marks concatenation of tables look very easy.

In [43]:
Marks1= pd.DataFrame({'name':['Walter','White','Saul','Goodman'],
                     'marks':[70,75,80,90]})
Marks1

Unnamed: 0,name,marks
0,Walter,70
1,White,75
2,Saul,80
3,Goodman,90


In [44]:
Marks2 = pd.DataFrame({'name' : ['Hank','Pinkman','Mike','Fring'],
    'marks' : [88,72,75,90]})
Marks2

Unnamed: 0,name,marks
0,Hank,88
1,Pinkman,72
2,Mike,75
3,Fring,90


We Created two tables, Marks1 and Marks2 with 4 records each.

In [45]:
pd.concat([Marks1,Marks2],axis=0)

Unnamed: 0,name,marks
0,Walter,70
1,White,75
2,Saul,80
3,Goodman,90
0,Hank,88
1,Pinkman,72
2,Mike,75
3,Fring,90


In [46]:
pd.concat([Marks1,Marks2],axis=1)

Unnamed: 0,name,marks,name.1,marks.1
0,Walter,70,Hank,88
1,White,75,Pinkman,72
2,Saul,80,Mike,75
3,Goodman,90,Fring,90


After Concatenation on the axis = 0, that is vertically, we get a table that is a result of the vertical concatenation of both Marks1 and Marks2.

In [None]:
Marks1 = pd.DataFrame({'name' : ['Walter','White','Saul','Goodman'],
    'marks' : [70,75,80,90]})


In [47]:
Age = pd.DataFrame({'hobby' : ['Cooking', 'Reading', 'Playing','Collecting Minerals']
                   ,'age' : [10,20,30,40]})

pd.concat([Marks1,Marks2],axis=1)

Unnamed: 0,name,marks,name.1,marks.1
0,Walter,70,Hank,88
1,White,75,Pinkman,72
2,Saul,80,Mike,75
3,Goodman,90,Fring,90


Similarly, when we change the axis, the concatenation becomes horizontal.

### Basic Functions

In [49]:
import pandas as pd

In [50]:
data=pd.read_csv("data.csv")

In [51]:
data

Unnamed: 0,COUNTRY,POP,AREA,GDP,CONT,IND_DAY
0,China,1398.72,9596.96,12234.78,Asia,
1,India,1351.16,3287.26,2575.67,Asia,1947-08-15
2,US,329.74,9833.52,19485.39,N.America,1776-07-04
3,Indonesia,268.07,1910.93,1015.54,Asia,1945-08-17
4,Brazil,210.32,8515.77,2055.51,S.America,1822-09-07
5,Pakistan,205.71,881.91,302.14,Asia,1947-08-14


In [53]:
# displaying first 5 rows of dataset
data.head()

Unnamed: 0,COUNTRY,POP,AREA,GDP,CONT,IND_DAY
0,China,1398.72,9596.96,12234.78,Asia,
1,India,1351.16,3287.26,2575.67,Asia,1947-08-15
2,US,329.74,9833.52,19485.39,N.America,1776-07-04
3,Indonesia,268.07,1910.93,1015.54,Asia,1945-08-17
4,Brazil,210.32,8515.77,2055.51,S.America,1822-09-07


## Unique function

The unique function given unique values in the series


In [54]:
data.CONT.unique()

array(['Asia', 'N.America', 'S.America'], dtype=object)

## nunique function

The nunique function gives count of unique values in the series

In [55]:
data.CONT.nunique()

3

## value_counts function
The value counts function gives no of time each values is occuring in the series

In [56]:
# applying value counts function on Team column of data

data.CONT.value_counts()

Asia         4
N.America    1
S.America    1
Name: CONT, dtype: int64

## describe function

The describe() method is used for calculating some statistical data like percentile, mean and std of the numerical values of the Series or DataFrame.

In [58]:
data.AREA.describe()

count       6.000000
mean     5671.058333
std      4088.720505
min       881.910000
25%      2255.012500
50%      5901.515000
75%      9326.662500
max      9833.520000
Name: AREA, dtype: float64

## isin function

The isin() function is used to check whether each element in the DataFrame or Series is contained in values or not.

In [59]:
#applying is in function to check if tec is present in team

data['COUNTRY'].isin(["US"])

0    False
1    False
2     True
3    False
4    False
5    False
Name: COUNTRY, dtype: bool

In [61]:
# fetch all the rows where CONT is ASIA

data[data['CONT'].isin(['ASIA','S.America'])]

Unnamed: 0,COUNTRY,POP,AREA,GDP,CONT,IND_DAY
4,Brazil,210.32,8515.77,2055.51,S.America,1822-09-07
