# Pandas Library

In this section we will learn how to use pandas for data analysis. You can think of pandas as an extremely powerful version of Excel, with a lot more features.

Outline:

- Pandas Info
- Installing pandas Library
- Series objects
- DataFrame



## Pandas Info

**What is Pandas?**

- Pandas is a Python library  that provides fast, flexible, and expressive data structures designed to make working with data sets.

- It has functions for analyzing, cleaning, exploring, and manipulating data.

- The name "Pandas" has a reference to both "Panel Data", and "Python Data Analysis" and was created by Wes McKinney in 2008.

**Why Use Pandas?**
- Pandas allows us to analyze big data and make conclusions based on statistical theories.

- Pandas can clean messy data sets, and make them readable and relevant.

- Relevant data is very important in data science.

- easy-to-use data structures and data analysis tools.

- The main data structure is the `DataFrame`, which you can think of as an in-memory 2D table (like a spreadsheet, with column names and row labels).
- fore more information please check linke:https://pandas.pydata.org/docs/getting_started/install.html

**What Can Pandas Do?**

**Pandas gives you answers about the data. Like:**

- Is there a correlation between two or more columns?
- What is average value?
- Max value?
- Min value?
- Pandas are also able to delete rows that are not relevant, or contains wrong values, like empty or NULL values. This is called cleaning the data.

## Installing Pandas Library

In [None]:
# !pip install pandas
# import pandas as pd
import pandas as pd

# import numpy as np
import numpy as np

## `Series` objects
The pandas library contains these useful data structures:
* `Series` object is 1D array, similar to a column in a spreadsheet (with a column name and row labels).
* Pandas Series is a one-dimensional labeled array capable of holding data of any type (integer, string, float, python objects, etc.)
* A Series is very similar to a NumPy array (in fact it is built on top of the NumPy array object). What differentiates the NumPy array from a Series, is that a Series can have axis labels, meaning it can be indexed by a label, instead of just a number location. It also doesn't need to hold numeric data, it can hold any arbitrary Python Object.


<img src=https://media.geeksforgeeks.org/wp-content/uploads/dataSER-1.png width=600>


In [None]:
import numpy as np
import pandas as pd

# Creating a NumPy array
numpy_array = np.array([10, 20, 30, 40, 50])
print("NumPy array:")
print(numpy_array)

# Creating a Pandas Series
pandas_series = pd.Series([10, 20, 30, 40, 50], index=['a', 'b', 'c', 'd', 'e'])
print("\nPandas Series:")
print(pandas_series)


NumPy array:
[10 20 30 40 50]

Pandas Series:
a    10
b    20
c    30
d    40
e    50
dtype: int64


In [None]:
s = pd.Series([2,-1,3,5])
type(s)
s.dtype

dtype('int64')

In [None]:
s

0    2
1   -1
2    3
3    5
dtype: int64

Arithmetic operations on Series are also possible, and they apply elementwise, just like for ndarrays:

In [None]:
s + [1000,2000,3000,4000]

0    1002
1    1999
2    3003
3    4005
dtype: int64

Similar to NumPy, if you add a single number to a Series, that number is added to all items in the Series. This is called * broadcasting*:

In [None]:
s + 1000

0    1002
1     999
2    1003
3    1005
dtype: int64

The same is true for all binary operations such as * or /, and even conditional operations:

In [None]:
s < 0

0    False
1     True
2    False
3    False
dtype: bool

## Index labels
Each item in a `Series` object has a unique identifier called the *index label*. By default, it is simply the rank of the item in the `Series` (starting at `0`) but you can also set the index labels manually:

In [None]:
# simple array
data = np.array(['Ali','Mahmoud','Essa','Sameer','Samai'])
ser = pd.Series(data, index= [5,10,15,20,25], name = "N")
print(ser)

data=['Ali','Mahmoud','Essa','Sameer','Samai']
ser = pd.Series(data)

print(ser)
data=['Ali','Mahmoud','Essa','Sameer','Samai']
ser = pd.Series(data,index = np.arange(0,5))
ser=pd.Series(data,index=np.arange(0,5))
print(ser)
print(type(ser))

5         Ali
10    Mahmoud
15       Essa
20     Sameer
25      Samai
Name: N, dtype: object
0        Ali
1    Mahmoud
2       Essa
3     Sameer
4      Samai
dtype: object
0        Ali
1    Mahmoud
2       Essa
3     Sameer
4      Samai
dtype: object
<class 'pandas.core.series.Series'>


# DataFrame

## Creating DataFrame
- Pandas DataFrame is a 2-dimensional labeled data structure with columns of potentially different types. It is generally the most commonly used pandas object.
- Pandas DataFrame can be created in multiple ways. Let’s discuss different ways to create a DataFrame one by one.
- please visit link for more information: https://www.youtube.com/watch?v=dEHJmn6p39M&t=93s
<img src=https://media.geeksforgeeks.org/wp-content/cdn-uploads/creating_dataframe1.png width=800>


### Method #1: Creating Pandas DataFrame from lists of lists.

In [None]:
# Import pandas library
import pandas as pd

# initialize list of lists
data = [['tom', 10], ['nick', 15], ['juli', 14]]

# Create the pandas DataFrame
df = pd.DataFrame(data, columns = ['Name', 'Age'])

# print dataframe.
df


Unnamed: 0,Name,Age
0,tom,10
1,nick,15
2,juli,14


### Method #2: Creating DataFrame from dict of narray/lists
To create DataFrame from dict of narray/list, all the narray must be of same length. If index is passed then the length index should be equal to the length of arrays. If no index is passed, then by default, index will be range(n) where n is the array length.

In [None]:
# Python code demonstrate creating
# DataFrame from dict narray / lists
# By default addresses.
import pandas as pd
# initialize data of lists.
data = {'Name':['Tom', 'nick', 'krish', 'jack'],
        'Age':[20, 21, 19, 18]}
# Create DataFrame
df = pd.DataFrame(data)
# Print the output.
df

Unnamed: 0,Name,Age
0,Tom,20
1,nick,21
2,krish,19
3,jack,18


### Method #3: Creates a indexes DataFrame using arrays.

In [None]:
# Python code demonstrate creating
# pandas DataFrame with indexed by

# DataFrame using arrays.
import pandas as pd

# initialize data of lists.
data = {'Name':['Tom', 'Jack', 'nick', 'juli'],
        'marks':[99, 98, 95, 90, ]}

# Creates pandas DataFrame.
df = pd.DataFrame(data, index =['rank1',
                                'rank2',
                                'rank3',
                                'rank4',
                                ])

# print the data
df


Unnamed: 0,Name,marks
rank1,Tom,99
rank2,Jack,98
rank3,nick,95
rank4,juli,90


### Method #4: Creating Dataframe from list of dicts
Pandas DataFrame can be created by passing lists of dictionaries as a input data. By default dictionary keys taken as columns.

In [None]:
# Python code demonstrate how to create
# Pandas DataFrame by lists of dicts.
import pandas as pd

# Initialize data to lists.
data = [{'a': 1, 'b': 2, 'c':3, "d" :10},
        {'a':10, 'b': 20, 'c': 30}]

# Creates DataFrame.
df = pd.DataFrame(data,index=['rank1',
                                'rank2'
                                ])
# Print the data
df


Unnamed: 0,a,b,c,d
rank1,1,2,3,10.0
rank2,10,20,30,


### Method #5: Creating Dataframe from Series
You can create a DataFrame by passing a dictionary of `Series` objects:

In [None]:
people_dict = {
    "weight": pd.Series([68, 83, 112], index=["alice", "bob", "charles"]),
    "birthyear": pd.Series([1984, 1985, 1992], index=["bob", "alice", "charles"]),
    "children": pd.Series([0, 3], index=["charles", "bob"]),
    "hobby": pd.Series(["Biking", "Dancing"], index=["alice", "bob"]),
}
people = pd.DataFrame(people_dict)
people

Unnamed: 0,weight,birthyear,children,hobby
alice,68,1985,,Biking
bob,83,1984,3.0,Dancing
charles,112,1992,0.0,


In [None]:
dict_1={"car":["city","ignis","800","Verna","Venue","Punto"],"brand":["Honda","maruti","Maruti","Hyundia","Hyundai","Fiat"],
         "cost":[900000,600000,100000,800000,950000,750000],"year":[5,7,10,4,2,6]}

In [None]:
auto=pd.DataFrame(dict_1)
auto

Unnamed: 0,car,brand,cost,year
0,city,Honda,900000,5
1,ignis,maruti,600000,7
2,800,Maruti,100000,10
3,Verna,Hyundia,800000,4
4,Venue,Hyundai,950000,2
5,Punto,Fiat,750000,6


In [None]:
auto_reset_index = auto.reset_index(drop=True)


In [None]:
auto.to_csv('automobile.csv')

### Dealing with Dataframe

In [None]:
import pandas as pd

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
Read=pd.read_csv("/content/drive/MyDrive/Colab Notebooks/bike_rentals.csv")
data=pd.DataFrame(Read)

In [None]:
data.head(2)

Unnamed: 0,instant,dayname,season,yr,mnth,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt
0,1,Saturday,1,0.0,1.0,0,6,0,2,0.344167,0.363625,0.805833,0.160446,331,654,985
1,2,Sunday,1,0.0,1.0,0,0,0,2,0.363478,0.353739,0.696087,0.248539,131,670,801


In [None]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 731 entries, 0 to 730
Data columns (total 16 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   instant     731 non-null    int64  
 1   dayname     731 non-null    object 
 2   season      731 non-null    int64  
 3   yr          730 non-null    float64
 4   mnth        730 non-null    float64
 5   holiday     731 non-null    int64  
 6   weekday     731 non-null    int64  
 7   workingday  731 non-null    int64  
 8   weathersit  731 non-null    int64  
 9   temp        730 non-null    float64
 10  atemp       730 non-null    float64
 11  hum         728 non-null    float64
 12  windspeed   726 non-null    float64
 13  casual      731 non-null    int64  
 14  registered  731 non-null    int64  
 15  cnt         731 non-null    int64  
dtypes: float64(6), int64(9), object(1)
memory usage: 91.5+ KB


In [None]:
data.shape

(731, 16)

In [None]:
data.columns

Index(['instant', 'dayname', 'season', 'yr', 'mnth', 'holiday', 'weekday',
       'workingday', 'weathersit', 'temp', 'atemp', 'hum', 'windspeed',
       'casual', 'registered', 'cnt'],
      dtype='object')

In [None]:
data.describe()

Unnamed: 0,instant,season,yr,mnth,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt
count,731.0,731.0,730.0,730.0,731.0,731.0,731.0,731.0,730.0,730.0,728.0,726.0,731.0,731.0,731.0
mean,366.0,2.49658,0.5,6.512329,0.028728,2.997264,0.682627,1.395349,0.495587,0.474512,0.627987,0.190476,848.176471,3656.172367,4504.348837
std,211.165812,1.110807,0.500343,3.448303,0.167155,2.004787,0.465773,0.544894,0.183094,0.163017,0.142331,0.077725,686.622488,1560.256377,1937.211452
min,1.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.05913,0.07907,0.0,0.022392,2.0,20.0,22.0
25%,183.5,2.0,0.0,4.0,0.0,1.0,0.0,1.0,0.336875,0.337794,0.521562,0.134494,315.5,2497.0,3152.0
50%,366.0,3.0,0.5,7.0,0.0,3.0,1.0,1.0,0.499167,0.487364,0.627083,0.180971,713.0,3662.0,4548.0
75%,548.5,3.0,1.0,9.75,0.0,5.0,1.0,2.0,0.655625,0.608916,0.730104,0.233218,1096.0,4776.5,5956.0
max,731.0,4.0,1.0,12.0,1.0,6.0,1.0,3.0,0.861667,0.840896,0.9725,0.507463,3410.0,6946.0,8714.0


In [None]:
data.head(1)

Unnamed: 0,instant,dayname,season,yr,mnth,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt
0,1,Saturday,1,0.0,1.0,0,6,0,2,0.344167,0.363625,0.805833,0.160446,331,654,985


In [None]:
data['season'].value_counts()


3    188
2    184
1    181
4    178
Name: season, dtype: int64

In [None]:
data["dayname"].unique()

array(['Saturday', 'Sunday', 'Monday', 'Tuesday', 'Wednesday', 'Thursday',
       'Friday'], dtype=object)

In [None]:
data.isnull().sum()

instant       0
dayname       0
season        0
yr            1
mnth          1
holiday       0
weekday       0
workingday    0
weathersit    0
temp          1
atemp         1
hum           3
windspeed     5
casual        0
registered    0
cnt           0
dtype: int64

### Accessing Columns

In [None]:
data['dayname']

0       Saturday
1         Sunday
2         Monday
3        Tuesday
4      Wednesday
         ...    
726     Thursday
727       Friday
728     Saturday
729       Sunday
730       Monday
Name: dayname, Length: 731, dtype: object

In [None]:
data[['weathersit', 'temp']]

Unnamed: 0,weathersit,temp
0,2,0.344167
1,2,0.363478
2,1,0.196364
3,1,0.200000
4,1,0.226957
...,...,...
726,2,0.254167
727,2,0.253333
728,2,0.253333
729,1,0.255833


### Accessing rows
Let's go back to the `people` `DataFrame`:

In [None]:
people = pd.DataFrame({
    "birthyear": {"alice":1985, "bob": 1984, "charles": 1992},
    "hobby": {"alice":"Biking", "bob": "Dancing"},
    "weight": {"alice":68, "bob": 83, "charles": 112},
    "children": {"bob": 3, "charles": 0}
})
people

Unnamed: 0,birthyear,hobby,weight,children
alice,1985,Biking,68,
bob,1984,Dancing,83,3.0
charles,1992,,112,0.0


The loc attribute lets you access rows instead of columns. The result is a Series object in which the DataFrame's column names are mapped to row index labels:

In [None]:
people.loc["charles"]

birthyear    1992
hobby         NaN
weight        112
children      0.0
Name: charles, dtype: object

You can also access rows by integer location using the iloc attribute:

In [None]:
people.iloc[2]

birthyear    1992
hobby         NaN
weight        112
children      0.0
Name: charles, dtype: object

You can also get a slice of rows, and this returns a DataFrame object:

In [None]:
people.iloc[1:3]

Unnamed: 0,birthyear,hobby,weight,children
bob,1984,Dancing,83,3.0
charles,1992,,112,0.0


Finally, you can pass a boolean array to get the matching rows:

In [None]:
people[np.array([True, False, True])]

Unnamed: 0,birthyear,hobby,weight,children
alice,1985,Biking,68,
charles,1992,,112,0.0


This is most useful when combined with boolean expressions:

In [None]:
people["birthyear"] < 1990

alice       True
bob         True
charles    False
Name: birthyear, dtype: bool

In [None]:
people[people["birthyear"] < 1990]

Unnamed: 0,birthyear,hobby,weight,children
alice,1985,Biking,68,
bob,1984,Dancing,83,3.0


### Adding and removing columns
You can generally treat DataFrame objects like dictionaries of Series, so the following work fine:

In [None]:
people

Unnamed: 0,birthyear,hobby,weight,children
alice,1985,Biking,68,
bob,1984,Dancing,83,3.0
charles,1992,,112,0.0


In [None]:
people["age"] = 2024 - people["birthyear"]  # adds a new column "age"
# people["age"]
people["over 30"] = people["age"] > 30      # adds another column "over 30"
birthyears = people.pop("birthyear")
del people["children"]

people

Unnamed: 0,hobby,weight,age,over 30
alice,Biking,68,39,True
bob,Dancing,83,40,True
charles,,112,32,True


In [None]:
birthyears

alice      1985
bob        1984
charles    1992
Name: birthyear, dtype: int64

When you add a new colum, it must have the same number of rows. Missing rows are filled with NaN, and extra rows are ignored:

In [None]:
people["pets"] = pd.Series({"bob": 0, "charles": 5, "eugene":1})  # alice is missing, eugene is ignored
people

Unnamed: 0,hobby,weight,age,over 30,pets
alice,Biking,68,39,True,
bob,Dancing,83,40,True,0.0
charles,,112,32,True,5.0


When adding a new column, it is added at the end (on the right) by default. You can also insert a column anywhere else using the insert() method:

In [None]:
people.insert(1, "height",[172, 181, 185])
people

Unnamed: 0,hobby,height,weight,age,over 30,pets
alice,Biking,172,68,39,True,
bob,Dancing,181,83,40,True,0.0
charles,,185,112,32,True,5.0


# Changing index

In [None]:
data=data.set_index("season")

In [None]:
data.head(10)

Unnamed: 0_level_0,instant,dayname,yr,mnth,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt
season,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
1,1,Saturday,0.0,1.0,0,6,0,2,0.344167,0.363625,0.805833,0.160446,331,654,985
1,2,Sunday,0.0,1.0,0,0,0,2,0.363478,0.353739,0.696087,0.248539,131,670,801
1,3,Monday,0.0,1.0,0,1,1,1,0.196364,0.189405,0.437273,0.248309,120,1229,1349
1,4,Tuesday,0.0,1.0,0,2,1,1,0.2,0.212122,0.590435,0.160296,108,1454,1562
1,5,Wednesday,0.0,1.0,0,3,1,1,0.226957,0.22927,0.436957,0.1869,82,1518,1600
1,6,Thursday,0.0,1.0,0,4,1,1,0.204348,0.233209,0.518261,0.089565,88,1518,1606
1,7,Friday,0.0,1.0,0,5,1,2,0.196522,0.208839,0.498696,0.168726,148,1362,1510
1,8,Saturday,0.0,1.0,0,6,0,2,0.165,0.162254,0.535833,0.266804,68,891,959
1,9,Sunday,0.0,1.0,0,0,0,1,0.138333,0.116175,0.434167,0.36195,54,768,822
1,10,Monday,0.0,1.0,0,1,1,1,0.150833,0.150888,0.482917,0.223267,41,1280,1321


In [None]:
data.reset_index(inplace = True)

In [None]:
data.head()

Unnamed: 0,season,instant,dayname,yr,mnth,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt
0,1,1,Saturday,0.0,1.0,0,6,0,2,0.344167,0.363625,0.805833,0.160446,331,654,985
1,1,2,Sunday,0.0,1.0,0,0,0,2,0.363478,0.353739,0.696087,0.248539,131,670,801
2,1,3,Monday,0.0,1.0,0,1,1,1,0.196364,0.189405,0.437273,0.248309,120,1229,1349
3,1,4,Tuesday,0.0,1.0,0,2,1,1,0.2,0.212122,0.590435,0.160296,108,1454,1562
4,1,5,Wednesday,0.0,1.0,0,3,1,1,0.226957,0.22927,0.436957,0.1869,82,1518,1600


# ***Filtering of data***

In [None]:
print(type(data))


<class 'pandas.core.frame.DataFrame'>


In [None]:
data['dayname']=="Saturday"

0       True
1      False
2      False
3      False
4      False
       ...  
726    False
727    False
728     True
729    False
730    False
Name: dayname, Length: 731, dtype: bool

In [None]:
new_data=data["dayname"]=="Saturday"

In [None]:
data.loc[new_data]

Unnamed: 0,season,instant,dayname,yr,mnth,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt
0,1,1,Saturday,0.0,1.0,0,6,0,2,0.344167,0.363625,0.805833,0.160446,331,654,985
7,1,8,Saturday,0.0,1.0,0,6,0,2,0.165000,0.162254,0.535833,0.266804,68,891,959
14,1,15,Saturday,0.0,1.0,0,6,0,2,0.233333,0.248112,0.498750,0.157963,222,1026,1248
21,1,22,Saturday,0.0,1.0,0,6,0,1,0.059130,0.079070,0.400000,0.171970,93,888,981
28,1,29,Saturday,0.0,1.0,0,6,0,1,0.196522,0.212126,0.651739,0.145365,123,975,1098
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
700,4,701,Saturday,1.0,12.0,0,6,0,2,0.298333,0.316904,0.806667,0.059704,951,4240,5191
707,4,708,Saturday,1.0,12.0,0,6,0,2,0.381667,0.389508,0.911250,0.101379,1153,4429,5582
714,4,715,Saturday,1.0,12.0,0,6,0,1,0.324167,0.338383,0.650417,0.106350,767,4280,5047
721,1,722,Saturday,1.0,12.0,0,6,0,1,0.265833,0.236113,0.441250,0.407346,205,1544,1749


In [None]:
auto

Unnamed: 0,car,brand,cost,year
0,city,Honda,900000,5
1,ignis,maruti,600000,7
2,800,Maruti,100000,10
3,Verna,Hyundia,800000,4
4,Venue,Hyundai,950000,2
5,Punto,Fiat,750000,6


In [None]:
isin_1 = auto['car'].isin(['800','city'])

***Problems***

In [None]:
start=auto['brand'].str.startswith('H')
auto.loc[start]

Unnamed: 0,car,brand,cost,year
0,city,Honda,900000,5
3,Verna,Hyundia,800000,4
4,Venue,Hyundai,950000,2


In [None]:
ends = auto['brand'].str.endswith('i')
auto.loc[ends]

Unnamed: 0,car,brand,cost,year
1,ignis,maruti,600000,7
2,800,Maruti,100000,10
4,Venue,Hyundai,950000,2


In [None]:
contains = ~auto['brand'].str.contains('o')
auto.loc[contains]

Unnamed: 0,car,brand,cost,year
1,ignis,maruti,600000,7
2,800,Maruti,100000,10
3,Verna,Hyundia,800000,4
4,Venue,Hyundai,950000,2
5,Punto,Fiat,750000,6


In [None]:
auto.loc[-contains]

Unnamed: 0,car,brand,cost,year
0,city,Honda,900000,5


In [None]:
isna = auto['car'].isna()
auto.loc[isna]

Unnamed: 0,car,brand,cost,year


In [None]:
notna = auto['car'].notna()
auto.loc[notna]

Unnamed: 0,car,brand,cost,year
0,city,Honda,900000,5
1,ignis,maruti,600000,7
2,800,Maruti,100000,10
3,Verna,Hyundia,800000,4
4,Venue,Hyundai,950000,2
5,Punto,Fiat,750000,6


### Sorting of Data

In [None]:
auto.sort_values("cost")

Unnamed: 0,car,brand,cost,year
2,800,Maruti,100000,10
1,ignis,maruti,600000,7
5,Punto,Fiat,750000,6
3,Verna,Hyundia,800000,4
0,city,Honda,900000,5
4,Venue,Hyundai,950000,2


In [None]:
data.sort_values(['dayname', 'season'],
                       ascending=[True, False])

Unnamed: 0,season,instant,dayname,yr,mnth,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt
265,4,266,Friday,0.0,9.0,0,5,1,2,0.609167,0.522125,0.972500,0.078367,258,2137,2395
272,4,273,Friday,0.0,9.0,0,5,1,1,0.564167,0.544829,0.647500,0.206475,830,4372,5202
279,4,280,Friday,0.0,10.0,0,5,1,1,0.510833,0.504404,0.684167,0.022392,949,4036,4985
286,4,287,Friday,0.0,10.0,0,5,1,2,0.550833,0.529675,0.716250,0.223883,529,3115,3644
293,4,294,Friday,0.0,10.0,0,5,1,1,0.427500,0.423596,0.574167,0.221396,676,3628,4304
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
417,1,418,Wednesday,1.0,2.0,0,3,1,1,0.395833,0.392667,0.567917,0.234471,394,4379,4773
424,1,425,Wednesday,1.0,2.0,0,3,1,2,0.344348,0.348470,0.804783,0.179117,65,1769,1834
431,1,432,Wednesday,1.0,3.0,0,3,1,1,0.404167,0.385100,0.513333,0.345779,432,4484,4916
438,1,439,Wednesday,1.0,3.0,0,3,1,1,0.572500,0.548617,0.507083,0.115062,997,5315,6312


### Updating Columns and Rows

*Changing the name of columns*

In [None]:
auto.columns=[x.capitalize() for x in auto.columns]

In [None]:
auto

Unnamed: 0,Car,Brand,Cost,Year
0,city,Honda,900000,5
1,ignis,maruti,600000,7
2,800,Maruti,100000,10
3,Verna,Hyundia,800000,4
4,Venue,Hyundai,950000,2
5,Punto,Fiat,750000,6


In [None]:
auto.columns=[x.lower() for x in auto.columns]

In [None]:
auto

Unnamed: 0,car,brand,cost,year
0,city,Honda,900000,5
1,ignis,maruti,600000,7
2,800,Maruti,100000,10
3,Verna,Hyundia,800000,4
4,Venue,Hyundai,950000,2
5,Punto,Fiat,750000,6


In [None]:
auto.rename(columns={"brand":"company","cost":"price"},inplace=True)

In [None]:
auto

Unnamed: 0,car,company,price,year
0,city,Honda,900000,5
1,ignis,maruti,600000,7
2,800,Maruti,100000,10
3,Verna,Hyundia,800000,4
4,Venue,Hyundai,950000,2
5,Punto,Fiat,750000,6


*Updating Columns*

In [None]:
auto["price"]=auto["price"].replace({600000:100000})

In [None]:
auto

Unnamed: 0,car,company,price,year
0,city,Honda,900000,5
1,ignis,maruti,100000,7
2,800,Maruti,100000,10
3,Verna,Hyundia,800000,4
4,Venue,Hyundai,950000,2
5,Punto,Fiat,750000,6


In [None]:
auto["car"].apply(str.upper)

0     CITY
1    IGNIS
2      800
3    VERNA
4    VENUE
5    PUNTO
Name: car, dtype: object

In [None]:
def price_change(price):
    return price/10

In [None]:
auto["price"].apply(price_change)

0    90000.0
1    10000.0
2    10000.0
3    80000.0
4    95000.0
5    75000.0
Name: price, dtype: float64

In [None]:
auto["price"].apply(lambda x:x*10)
#lamda is a virtual function

0    9000000
1    1000000
2    1000000
3    8000000
4    9500000
5    7500000
Name: price, dtype: int64

In [None]:
auto["car"].replace({"city":"city zx","ignis":"swift"})

0    city zx
1      swift
2        800
3      Verna
4      Venue
5      Punto
Name: car, dtype: object

*Updating Rows*

In [None]:
auto.loc[2,["car","price"]]=["dezire",600000]

In [None]:
auto

Unnamed: 0,car,company,price,year
0,city,Honda,900000,5
1,ignis,maruti,100000,7
2,dezire,Maruti,600000,10
3,Verna,Hyundia,800000,4
4,Venue,Hyundai,950000,2
5,Punto,Fiat,750000,6


*Adding and removing a column*

In [None]:
fuel_type=["Diesel","Petrol","Petrol","Diesel","Petrol","Diesel"]

In [None]:
auto["Fuel Types"]=fuel_type

In [None]:
auto

Unnamed: 0,car,company,price,year,Fuel Types
0,city,Honda,900000,5,Diesel
1,ignis,maruti,100000,7,Petrol
2,dezire,Maruti,600000,10,Petrol
3,Verna,Hyundia,800000,4,Diesel
4,Venue,Hyundai,950000,2,Petrol
5,Punto,Fiat,750000,6,Diesel


In [None]:
auto.drop("Fuel Types",axis=1,inplace=True)

*Adding and Removing Rows*

In [None]:
auto=auto.append({"car":"Tiago","company":"TATA"},ignore_index=True)

  auto=auto.append({"car":"Tiago","company":"TATA"},ignore_index=True)


In [None]:
auto.drop(index=6)

Unnamed: 0,car,company,price,year
0,city,Honda,900000.0,5.0
1,ignis,maruti,100000.0,7.0
2,dezire,Maruti,600000.0,10.0
3,Verna,Hyundia,800000.0,4.0
4,Venue,Hyundai,950000.0,2.0
5,Punto,Fiat,750000.0,6.0


### Grouping And Aggregation

In [None]:
data=pd.DataFrame(Read)


In [None]:
groups = data.groupby(['season'])

In [None]:
groups.mean()

  groups.mean()


Unnamed: 0_level_0,instant,yr,mnth,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt
season,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
1,262.685083,0.5,3.044444,0.038674,3.0,0.657459,1.40884,0.297748,0.296914,0.581498,0.214692,334.928177,2269.20442,2604.132597
2,308.5,0.5,4.652174,0.021739,2.98913,0.695652,1.402174,0.544405,0.520307,0.627701,0.203428,1106.097826,3886.233696,4992.331522
3,401.5,0.5,7.691489,0.021277,3.031915,0.696809,1.297872,0.706309,0.655898,0.634243,0.172095,1202.611702,4441.691489,5644.303191
4,493.0,0.5,10.696629,0.033708,2.966292,0.679775,1.477528,0.423332,0.415857,0.668719,0.172127,729.11236,3999.050562,4728.162921


In [None]:
groups.size()

season
1    181
2    184
3    188
4    178
dtype: int64

In [None]:
groups.count()

Unnamed: 0_level_0,instant,dayname,yr,mnth,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt
season,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
1,181,181,180,180,181,181,181,181,181,181,180,180,181,181,181
2,184,184,184,184,184,184,184,184,184,184,183,181,184,184,184
3,188,188,188,188,188,188,188,188,188,188,187,188,188,188,188
4,178,178,178,178,178,178,178,178,177,177,178,177,178,178,178


### Cleaning of Data

In [None]:
import numpy as np

dict_1={"car":["city","ignis",np.nan,"Verna","Venue","Punto",np.nan],"brand":["Honda","maruti","Maruti","Hyundia",np.nan,"Fiat",np.nan],
         "cost":[900000,600000,np.nan,800000,950000,750000,np.nan,],"year":[5,7,10,np.nan,np.nan,6,np.nan]}


auto=pd.DataFrame(dict_1)
auto

Unnamed: 0,car,brand,cost,year
0,city,Honda,900000.0,5.0
1,ignis,maruti,600000.0,7.0
2,,Maruti,,10.0
3,Verna,Hyundia,800000.0,
4,Venue,,950000.0,
5,Punto,Fiat,750000.0,6.0
6,,,,


In [None]:
auto.isnull().sum()

car      2
brand    2
cost     2
year     3
dtype: int64

In [None]:
auto

Unnamed: 0,car,brand,cost,year
0,city,Honda,900000.0,5.0
1,ignis,maruti,600000.0,7.0
2,,Maruti,0.0,10.0
3,Verna,Hyundia,800000.0,
4,Venue,,950000.0,
5,Punto,Fiat,750000.0,6.0
6,,,0.0,


In [None]:
auto.dropna(axis="index",how='all',subset=["cost"])

Unnamed: 0,car,brand,cost,year
0,city,Honda,900000.0,5.0
1,ignis,maruti,600000.0,7.0
3,Verna,Hyundia,800000.0,
4,Venue,,950000.0,
5,Punto,Fiat,750000.0,6.0


In [None]:
auto["cost"].fillna(0,inplace=True)

In [None]:
auto

Unnamed: 0,car,brand,cost,year
0,city,Honda,900000.0,5.0
1,ignis,maruti,600000.0,7.0
2,,Maruti,0.0,10.0
3,Verna,Hyundia,800000.0,
4,Venue,,950000.0,
5,Punto,Fiat,750000.0,6.0
6,,,0.0,


### Concat() function in Pandas DtaFrame

In [None]:
# importing the module
import pandas as pd

# creating the DataFrames
df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
					'B': ['B0', 'B1', 'B2', 'B3']})
display('df1:', df1)
df2 = pd.DataFrame({'A': ['A4', 'A5', 'A6', 'A7'],
					'B': ['B4', 'B5', 'B6', 'B7']})
display('df2:', df2)

# concatenating
print('After concatenating:')
display(pd.concat([df1, df2],
				keys = ['key1', 'key2']))


'df1:'

Unnamed: 0,A,B
0,A0,B0
1,A1,B1
2,A2,B2
3,A3,B3


'df2:'

Unnamed: 0,A,B
0,A4,B4
1,A5,B5
2,A6,B6
3,A7,B7


After concatenating:


Unnamed: 0,Unnamed: 1,A,B
key1,0,A0,B0
key1,1,A1,B1
key1,2,A2,B2
key1,3,A3,B3
key2,0,A4,B4
key2,1,A5,B5
key2,2,A6,B6
key2,3,A7,B7


In [None]:
# importing the module
import pandas as pd

# creating the DataFrames
df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
					'B': ['B0', 'B1', 'B2', 'B3']})
display('df1:', df1)
df2 = pd.DataFrame({'C': ['C0', 'C1', 'C2', 'C3'],
					'D': ['D0', 'D1', 'D2', 'D3']})
display('df2:', df2)

# concatenating
display('After concatenating:')
display(pd.concat([df1, df2],
				axis = 1))


'df1:'

Unnamed: 0,A,B
0,A0,B0
1,A1,B1
2,A2,B2
3,A3,B3


'df2:'

Unnamed: 0,C,D
0,C0,D0
1,C1,D1
2,C2,D2
3,C3,D3


'After concatenating:'

Unnamed: 0,A,B,C,D
0,A0,B0,C0,D0
1,A1,B1,C1,D1
2,A2,B2,C2,D2
3,A3,B3,C3,D3


In [None]:
import pandas as pd

# creating the DataFrames
df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
					'B': ['B0', 'B1', 'B2', 'B3']})
display('df1:', df1)
df2 = pd.DataFrame({'A': ['A4', 'A5', 'A6', 'A7'],
					'B': ['B4', 'B5', 'B6', 'B7']})
display('df2:', df2)

# concatenating
display('After concatenating:')
display(pd.concat([df1, df2],
				ignore_index = True))


'df1:'

Unnamed: 0,A,B
0,A0,B0
1,A1,B1
2,A2,B2
3,A3,B3


'df2:'

Unnamed: 0,A,B
0,A4,B4
1,A5,B5
2,A6,B6
3,A7,B7


'After concatenating:'

Unnamed: 0,A,B
0,A0,B0
1,A1,B1
2,A2,B2
3,A3,B3
4,A4,B4
5,A5,B5
6,A6,B6
7,A7,B7


In [None]:
# importing the module
import pandas as pd

# creating the DataFrame
df = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
					'B': ['B0', 'B1', 'B2', 'B3']})
display('df:', df1)
# creating the Series
series = pd.Series([1, 2, 3, 4])
display('series:', series)

# concatenating
display('After concatenating:')
display(pd.concat([df, series],
				axis = 1))


'df:'

Unnamed: 0,A,B
0,A0,B0
1,A1,B1
2,A2,B2
3,A3,B3


'series:'

0    1
1    2
2    3
3    4
dtype: int64

'After concatenating:'

Unnamed: 0,A,B,0
0,A0,B0,1
1,A1,B1,2
2,A2,B2,3
3,A3,B3,4


In [None]:
import pandas as pd
import timeit

# Method 1: Using iterrows()
def method_with_iterrows(df):
    for index, row in df.iterrows():
        df.at[index, 'new_column'] = row['old_column'].upper()
    return df

# Method 2: Without iterrows() (using vectorized operations)
def method_without_iterrows(df):
    df['new_column'] = df['old_column'].str.upper()
    return df

# Generate sample data
data = {'old_column': ['apple', 'banana', 'orange'] * 10**5}
df = pd.DataFrame(data)

# Measure time for method with iterrows()
time_with_iterrows = timeit.timeit("method_with_iterrows(df.copy())", globals=globals(), number=1)

# Measure time for method without iterrows()
time_without_iterrows = timeit.timeit("method_without_iterrows(df.copy())", globals=globals(), number=1)

# Display results
print(f"Method with iterrows() execution time: {time_with_iterrows:.5f} seconds")
print(f"Method without iterrows() execution time: {time_without_iterrows:.5f} seconds")


Method with iterrows() execution time: 21.70851 seconds
Method without iterrows() execution time: 0.08651 seconds
