<a href="https://colab.research.google.com/github/avijitdas126/Portfoilo_template/blob/main/pandas.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install pandas



# What is Pandas?
Pandas is a Python library used for working with data sets.

It has functions for analyzing, cleaning, exploring, and manipulating data.

In [1]:
import pandas as pd

In [2]:
mydataset = {
  'cars': ["BMW", "Volvo", "Ford"],
  'passings': [3, 7, 2]
}

myvar = pd.DataFrame(mydataset)

myvar

Unnamed: 0,cars,passings
0,BMW,3
1,Volvo,7
2,Ford,2


# Checking Pandas Version
The version string is stored under ```__version__``` attribute.

In [None]:
print(pd.__version__)

2.2.2


# What is a Series?
A Pandas Series is like a column in a table.

It is a one-dimensional array holding data of any type.

In [3]:
a = [1, 7, 2]

myvar = pd.Series(a)
myvar

Unnamed: 0,0
0,1
1,7
2,2


# Labels
If nothing else is specified, the values are labeled with their index number. First value has index 0, second value has index 1 etc.

This label can be used to access a specified value.

In [4]:
print(myvar[0])

1


# Create Labels
With the ```index``` argument, you can name your own labels.

In [5]:
myvar = pd.Series(a, index = ["x", "y", "z"])

myvar

Unnamed: 0,0
x,1
y,7
z,2


When you have created labels, you can access an item by referring to the label.

In [6]:
print(myvar["y"])

7


# Key/Value Objects as Series
You can also use a key/value object, like a dictionary, when creating a Series.
# Example
Create a simple Pandas Series from a dictionary

In [7]:
calories = {"day1": 420, "day2": 380, "day3": 390}

myvar = pd.Series(calories)

myvar

Unnamed: 0,0
day1,420
day2,380
day3,390



**Note:** The keys of the dictionary become the labels.

To select only some of the items in the dictionary, use the ```index``` argument and specify only the items you want to include in the Series.

# Example
Create a Series using only data from "day1" and "day2":

In [8]:
calories = {"day1": 420, "day2": 380, "day3": 390}

myvar = pd.Series(calories, index = ["day1", "day2"])

myvar

Unnamed: 0,0
day1,420
day2,380


# DataFrames
Data sets in Pandas are usually multi-dimensional tables, called DataFrames.

Series is like a column, a DataFrame is the whole table.
# Example
Create a DataFrame from two Series:

In [9]:
data = {
  "calories": [420, 380, 390],
  "duration": [50, 40, 45]
}

myvar = pd.DataFrame(data)

myvar

Unnamed: 0,calories,duration
0,420,50
1,380,40
2,390,45


# What is a DataFrame?
A Pandas DataFrame is a 2 dimensional data structure, like a 2 dimensional array, or a table with rows and columns.

In [10]:
data = {
  "calories": [420, 380, 390],
  "duration": [50, 40, 45]
}

#load data into a DataFrame object:
df = pd.DataFrame(data)

df

Unnamed: 0,calories,duration
0,420,50
1,380,40
2,390,45


# Locate Row
As you can see from the result above, the DataFrame is like a table with rows and columns.

Pandas use the ```loc``` attribute to return one or more specified row(s)

In [11]:
#refer to the row index:
df.loc[0]

Unnamed: 0,0
calories,420
duration,50


**Note:** This example returns a Pandas **Series**.

In [12]:
#use a list of indexes:
df.loc[[0, 1]]

Unnamed: 0,calories,duration
0,420,50
1,380,40


**Note:** When using ```[]```, the result is a Pandas **DataFrame**.

In [13]:
data = {
  "calories": [420, 380, 390],
  "duration": [50, 40, 45]
}

df = pd.DataFrame(data, index = ["day1", "day2", "day3"])
df

Unnamed: 0,calories,duration
day1,420,50
day2,380,40
day3,390,45


# Locate Named Indexes
Use the named index in the ```loc``` attribute to return the specified row(s).
# Example
Return "day2":

In [14]:
#refer to the named index:
df.loc["day2"]

Unnamed: 0,day2
calories,380
duration,40


# Load Files Into a DataFrame
If your data sets are stored in a file, Pandas can load them into a DataFrame.
# Example
Load a comma separated file (CSV file) into a DataFrame:

In [15]:
df = pd.read_csv('/content/sample_data/california_housing_test.csv')

df

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
0,-122.05,37.37,27.0,3885.0,661.0,1537.0,606.0,6.6085,344700.0
1,-118.30,34.26,43.0,1510.0,310.0,809.0,277.0,3.5990,176500.0
2,-117.81,33.78,27.0,3589.0,507.0,1484.0,495.0,5.7934,270500.0
3,-118.36,33.82,28.0,67.0,15.0,49.0,11.0,6.1359,330000.0
4,-119.67,36.33,19.0,1241.0,244.0,850.0,237.0,2.9375,81700.0
...,...,...,...,...,...,...,...,...,...
2995,-119.86,34.42,23.0,1450.0,642.0,1258.0,607.0,1.1790,225000.0
2996,-118.14,34.06,27.0,5257.0,1082.0,3496.0,1036.0,3.3906,237200.0
2997,-119.70,36.30,10.0,956.0,201.0,693.0,220.0,2.2895,62000.0
2998,-117.12,34.10,40.0,96.0,14.0,46.0,14.0,3.2708,162500.0


# Read CSV Files
A simple way to store big data sets is to use CSV files (comma separated files).

CSV files contains plain text and is a well know format that can be read by everyone including Pandas.

In [16]:
df = pd.read_csv('https://www.w3schools.com/python/pandas/data.csv')
print(df.to_string())

     Duration  Pulse  Maxpulse  Calories
0          60    110       130     409.1
1          60    117       145     479.0
2          60    103       135     340.0
3          45    109       175     282.4
4          45    117       148     406.0
5          60    102       127     300.0
6          60    110       136     374.0
7          45    104       134     253.3
8          30    109       133     195.1
9          60     98       124     269.0
10         60    103       147     329.3
11         60    100       120     250.7
12         60    106       128     345.3
13         60    104       132     379.3
14         60     98       123     275.0
15         60     98       120     215.2
16         60    100       120     300.0
17         45     90       112       NaN
18         60    103       123     323.0
19         45     97       125     243.0
20         60    108       131     364.2
21         45    100       119     282.0
22         60    130       101     300.0
23         45   

**Tip:** use ```to_string()``` to print the entire DataFrame.


If you have a large DataFrame with many rows, Pandas will only return the first 5 rows, and the last 5 rows:
# Example
Print the DataFrame without the ```to_string()``` method:

In [17]:
df = pd.read_csv('https://www.w3schools.com/python/pandas/data.csv')
df

Unnamed: 0,Duration,Pulse,Maxpulse,Calories
0,60,110,130,409.1
1,60,117,145,479.0
2,60,103,135,340.0
3,45,109,175,282.4
4,45,117,148,406.0
...,...,...,...,...
164,60,105,140,290.8
165,60,110,145,300.0
166,60,115,145,310.2
167,75,120,150,320.4


# max_rows
The number of rows returned is defined in Pandas option settings.
You can check your system's maximum rows with the ```pd.options.display.max_rows``` statement.

In [18]:
print(pd.options.display.max_rows)

60



In my system the number is 60, which means that if the DataFrame contains more than 60 rows, the ```print(df)``` statement will return only the headers and the first and last 5 rows.

You can change the maximum rows number with the same statement.
# Example
Increase the maximum number of rows to display the entire DataFrame:

In [19]:
pd.options.display.max_rows = 60
df

Unnamed: 0,Duration,Pulse,Maxpulse,Calories
0,60,110,130,409.1
1,60,117,145,479.0
2,60,103,135,340.0
3,45,109,175,282.4
4,45,117,148,406.0
...,...,...,...,...
164,60,105,140,290.8
165,60,110,145,300.0
166,60,115,145,310.2
167,75,120,150,320.4


# Read JSON
Big data sets are often stored, or extracted as JSON.

JSON is plain text, but has the format of an object, and is well known in the world of programming, including Pandas.

In [20]:
df = pd.read_json('/content/sample_data/anscombe.json')

print(df.to_string())

   Series   X      Y
0       I  10   8.04
1       I   8   6.95
2       I  13   7.58
3       I   9   8.81
4       I  11   8.33
5       I  14   9.96
6       I   6   7.24
7       I   4   4.26
8       I  12  10.84
9       I   7   4.81
10      I   5   5.68
11     II  10   9.14
12     II   8   8.14
13     II  13   8.74
14     II   9   8.77
15     II  11   9.26
16     II  14   8.10
17     II   6   6.13
18     II   4   3.10
19     II  12   9.13
20     II   7   7.26
21     II   5   4.74
22    III  10   7.46
23    III   8   6.77
24    III  13  12.74
25    III   9   7.11
26    III  11   7.81
27    III  14   8.84
28    III   6   6.08
29    III   4   5.39
30    III  12   8.15
31    III   7   6.42
32    III   5   5.73
33     IV   8   6.58
34     IV   8   5.76
35     IV   8   7.71
36     IV   8   8.84
37     IV   8   8.47
38     IV   8   7.04
39     IV   8   5.25
40     IV  19  12.50
41     IV   8   5.56
42     IV   8   7.91
43     IV   8   6.89



**Tip:** use ```to_string()``` to print the entire DataFrame.
# Dictionary as JSON
**JSON = Python Dictionary**

JSON objects have the same format as Python dictionaries.
If your JSON code is not in a file, but in a Python Dictionary, you can load it into a DataFrame directly:
# Example
Load a Python Dictionary into a DataFrame:

In [21]:
data = {
  "Duration":{
    "0":60,
    "1":60,
    "2":60,
    "3":45,
    "4":45,
    "5":60
  },
  "Pulse":{
    "0":110,
    "1":117,
    "2":103,
    "3":109,
    "4":117,
    "5":102
  },
  "Maxpulse":{
    "0":130,
    "1":145,
    "2":135,
    "3":175,
    "4":148,
    "5":127
  },
  "Calories":{
    "0":409,
    "1":479,
    "2":340,
    "3":282,
    "4":406,
    "5":300
  }
}
df=pd.DataFrame(data)
df

Unnamed: 0,Duration,Pulse,Maxpulse,Calories
0,60,110,130,409
1,60,117,145,479
2,60,103,135,340
3,45,109,175,282
4,45,117,148,406
5,60,102,127,300


# Viewing the Data
One of the most used method for getting a quick overview of the DataFrame, is the ```head()``` method.

The ```head()``` method returns the headers and a specified number of rows, starting from the top.

In [22]:
df = pd.read_csv('https://www.w3schools.com/python/pandas/data.csv')
df.head(10)

Unnamed: 0,Duration,Pulse,Maxpulse,Calories
0,60,110,130,409.1
1,60,117,145,479.0
2,60,103,135,340.0
3,45,109,175,282.4
4,45,117,148,406.0
5,60,102,127,300.0
6,60,110,136,374.0
7,45,104,134,253.3
8,30,109,133,195.1
9,60,98,124,269.0


**Note:** if the number of rows is not specified, the head() method will return the top 5 rows.

In [23]:
df.head()

Unnamed: 0,Duration,Pulse,Maxpulse,Calories
0,60,110,130,409.1
1,60,117,145,479.0
2,60,103,135,340.0
3,45,109,175,282.4
4,45,117,148,406.0



There is also a ```tail()``` method for viewing the last rows of the DataFrame.

The ```tail()``` method returns the headers and a specified number of rows, starting from the bottom.
# Example
Print the last 5 rows of the DataFrame:

In [24]:
df.tail()

Unnamed: 0,Duration,Pulse,Maxpulse,Calories
164,60,105,140,290.8
165,60,110,145,300.0
166,60,115,145,310.2
167,75,120,150,320.4
168,75,125,150,330.4


# Info About the Data
The DataFrames object has a method called ```info()```, that gives you more information about the data set.
# Example
Print information about the data:

In [25]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 169 entries, 0 to 168
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Duration  169 non-null    int64  
 1   Pulse     169 non-null    int64  
 2   Maxpulse  169 non-null    int64  
 3   Calories  164 non-null    float64
dtypes: float64(1), int64(3)
memory usage: 5.4 KB


# Data Cleaning
Data cleaning means fixing bad data in your data set.

Bad data could be:

* Empty cells
* Data in wrong format
* Wrong data
* Duplicates

# Empty Cells
Empty cells can potentially give you a wrong result when you analyze data.
# Remove Rows
One way to deal with empty cells is to remove rows that contain empty cells.

This is usually OK, since data sets can be very big, and removing a few rows will not have a big impact on the result.

In [26]:
new_df = df.dropna()

new_df

Unnamed: 0,Duration,Pulse,Maxpulse,Calories
0,60,110,130,409.1
1,60,117,145,479.0
2,60,103,135,340.0
3,45,109,175,282.4
4,45,117,148,406.0
...,...,...,...,...
164,60,105,140,290.8
165,60,110,145,300.0
166,60,115,145,310.2
167,75,120,150,320.4


**Note:** By default, the``` dropna()``` method returns a new DataFrame, and will not change the original.

If you want to change the original DataFrame, use the ```inplace = True``` argument:

In [27]:
df.dropna(inplace = True)

df

Unnamed: 0,Duration,Pulse,Maxpulse,Calories
0,60,110,130,409.1
1,60,117,145,479.0
2,60,103,135,340.0
3,45,109,175,282.4
4,45,117,148,406.0
...,...,...,...,...
164,60,105,140,290.8
165,60,110,145,300.0
166,60,115,145,310.2
167,75,120,150,320.4



**Note:** Now, the ```dropna(inplace = True)``` will NOT return a new DataFrame, but it will remove all rows containing NULL values from the original DataFrame.
# Replace Empty Values
Another way of dealing with empty cells is to insert a new value instead.

This way you do not have to delete entire rows just because of some empty cells.

The ```fillna()``` method allows us to replace empty cells with a value:
# Example
Replace NULL values with the number 130:

In [28]:
df.fillna(130, inplace = True)
df

Unnamed: 0,Duration,Pulse,Maxpulse,Calories
0,60,110,130,409.1
1,60,117,145,479.0
2,60,103,135,340.0
3,45,109,175,282.4
4,45,117,148,406.0
...,...,...,...,...
164,60,105,140,290.8
165,60,110,145,300.0
166,60,115,145,310.2
167,75,120,150,320.4


# Replace Only For Specified Columns
The example above replaces all empty cells in the whole Data Frame.

To only replace empty values for one column, specify the column name for the DataFrame:
# Example
Replace NULL values in the "Calories" columns with the number 130:

In [29]:

df["Calories"]=df["Calories"].fillna(130)
df

Unnamed: 0,Duration,Pulse,Maxpulse,Calories
0,60,110,130,409.1
1,60,117,145,479.0
2,60,103,135,340.0
3,45,109,175,282.4
4,45,117,148,406.0
...,...,...,...,...
164,60,105,140,290.8
165,60,110,145,300.0
166,60,115,145,310.2
167,75,120,150,320.4


# Replace Using Mean, Median, or Mode
A common way to replace empty cells, is to calculate the mean, median or mode value of the column.
Pandas uses the ```mean()``` ```median()``` and ```mode()``` methods to calculate the respective values for a specified column:
# Example
Calculate the MEAN:

In [30]:
x = df["Calories"].mean()
x

375.79024390243904


**Mean** = the average value (the sum of all values divided by number of values).
# Example
Calculate the MEDIAN, and replace any empty values with it:

In [31]:
x = df["Calories"].median()
x

318.6


**Median** = the value in the middle, after you have sorted all values ascending.
# Example
Calculate the MODE, and replace any empty values with it:

In [32]:
x = df["Calories"].mode()
x

Unnamed: 0,Calories
0,300.0


**Mode** = the value that appears most frequently.

In [34]:
df=pd.read_csv("weather_data.csv",parse_dates=['day'])
df

Unnamed: 0,day,temperature,windspeed,event
0,2017-01-01,32.0,6.0,Rain
1,2017-01-04,,9.0,Sunny
2,2017-01-05,28.0,,Snow
3,2017-01-06,,7.0,
4,2017-01-07,32.0,,Rain
5,2017-01-08,,,Sunny
6,2017-01-09,,,
7,2017-01-10,34.0,8.0,Cloudy
8,2017-01-11,40.0,12.0,Sunny


```df.fillna(value)```

Replace all NA/null data with value.

In [35]:
df.fillna('NA')

Unnamed: 0,day,temperature,windspeed,event
0,2017-01-01,32.0,6.0,Rain
1,2017-01-04,,9.0,Sunny
2,2017-01-05,28.0,,Snow
3,2017-01-06,,7.0,
4,2017-01-07,32.0,,Rain
5,2017-01-08,,,Sunny
6,2017-01-09,,,
7,2017-01-10,34.0,8.0,Cloudy
8,2017-01-11,40.0,12.0,Sunny


```df.dropna()```

Drop rows with any column having NA/null data.

In [36]:
x=df.dropna()
x

Unnamed: 0,day,temperature,windspeed,event
0,2017-01-01,32.0,6.0,Rain
7,2017-01-10,34.0,8.0,Cloudy
8,2017-01-11,40.0,12.0,Sunny


```df.describe()```

Basic descriptive and statistics for each column (or GroupBy).

In [37]:
x.describe()

Unnamed: 0,day,temperature,windspeed
count,3,3.0,3.0
mean,2017-01-07 08:00:00,35.333333,8.666667
min,2017-01-01 00:00:00,32.0,6.0
25%,2017-01-05 12:00:00,33.0,7.0
50%,2017-01-10 00:00:00,34.0,8.0
75%,2017-01-10 12:00:00,37.0,10.0
max,2017-01-11 00:00:00,40.0,12.0
std,,4.163332,3.05505


```set_index()```

Set the Index of the DataFrame

In [38]:
df.set_index('day',inplace=True)
df

Unnamed: 0_level_0,temperature,windspeed,event
day,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2017-01-01,32.0,6.0,Rain
2017-01-04,,9.0,Sunny
2017-01-05,28.0,,Snow
2017-01-06,,7.0,
2017-01-07,32.0,,Rain
2017-01-08,,,Sunny
2017-01-09,,,
2017-01-10,34.0,8.0,Cloudy
2017-01-11,40.0,12.0,Sunny


In [39]:
x1=df.fillna(0)
x1

Unnamed: 0_level_0,temperature,windspeed,event
day,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2017-01-01,32.0,6.0,Rain
2017-01-04,0.0,9.0,Sunny
2017-01-05,28.0,0.0,Snow
2017-01-06,0.0,7.0,0
2017-01-07,32.0,0.0,Rain
2017-01-08,0.0,0.0,Sunny
2017-01-09,0.0,0.0,0
2017-01-10,34.0,8.0,Cloudy
2017-01-11,40.0,12.0,Sunny


Fill na using column names and dict

In [40]:
new_df = df.fillna({
        'temperature': 0,
        'windspeed': 0,
        'event': 'No Event'
    })
new_df

Unnamed: 0_level_0,temperature,windspeed,event
day,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2017-01-01,32.0,6.0,Rain
2017-01-04,0.0,9.0,Sunny
2017-01-05,28.0,0.0,Snow
2017-01-06,0.0,7.0,No Event
2017-01-07,32.0,0.0,Rain
2017-01-08,0.0,0.0,Sunny
2017-01-09,0.0,0.0,No Event
2017-01-10,34.0,8.0,Cloudy
2017-01-11,40.0,12.0,Sunny


Use method to determine how to fill na values

Fill NA/NaN values by propagating the last valid observation to next valid.

In [41]:
new_df = df.ffill()
new_df

Unnamed: 0_level_0,temperature,windspeed,event
day,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2017-01-01,32.0,6.0,Rain
2017-01-04,32.0,9.0,Sunny
2017-01-05,28.0,9.0,Snow
2017-01-06,28.0,7.0,Snow
2017-01-07,32.0,7.0,Rain
2017-01-08,32.0,7.0,Sunny
2017-01-09,32.0,7.0,Sunny
2017-01-10,34.0,8.0,Cloudy
2017-01-11,40.0,12.0,Sunny


Fill NA/NaN values by using the next valid observation to fill the gap.

In [42]:
new_df = df.bfill()
new_df

Unnamed: 0_level_0,temperature,windspeed,event
day,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2017-01-01,32.0,6.0,Rain
2017-01-04,28.0,9.0,Sunny
2017-01-05,28.0,7.0,Snow
2017-01-06,32.0,7.0,Rain
2017-01-07,32.0,8.0,Rain
2017-01-08,34.0,8.0,Sunny
2017-01-09,34.0,8.0,Cloudy
2017-01-10,34.0,8.0,Cloudy
2017-01-11,40.0,12.0,Sunny
