# Pandas

Provides high-performance, easy-to-use data structures and analysis tools for the Python programming language

Open-source Python library providing high-performance data manipulation and analysis ool using its powerful data structures

Name 'pandas' is derived from the word
Panel Data — an econometrics term for
multidimensional data

### Dataframe

two-dimensional size-mutable

potentially heterogeneous tabular data structure with labeled axes (rows and columns)

In [1]:
import os # to change pwd
import pandas as pd # to work with dataframes
import numpy as np # to perform numeric operations

os.chdir('D:\\Downloads\\Datasets\\NPTEL')

In [2]:
cars_data = pd.read_csv('Toyota.csv', 
                        index_col=0, 
                        na_values=['??', '###'])
cars_data

Unnamed: 0,Price,Age,KM,FuelType,HP,MetColor,Automatic,CC,Doors,Weight
0,13500,23.0,46986.0,Diesel,90,1.0,0,2000,three,1165
1,13750,23.0,72937.0,Diesel,90,1.0,0,2000,3,1165
2,13950,24.0,41711.0,Diesel,90,,0,2000,3,1165
3,14950,26.0,48000.0,Diesel,90,0.0,0,2000,3,1165
4,13750,30.0,38500.0,Diesel,90,0.0,0,2000,3,1170
...,...,...,...,...,...,...,...,...,...,...
1431,7500,,20544.0,Petrol,86,1.0,0,1300,3,1025
1432,10845,72.0,,Petrol,86,0.0,0,1300,3,1015
1433,8500,,17016.0,Petrol,86,0.0,0,1300,3,1015
1434,7250,70.0,,,86,1.0,0,1300,3,1015


### Shallow copy

It only creates a new variable
that shares the reference of
the original object

Any changes made to a copy
of object will be reflected in
the original object as well

In [3]:
sample_cars_data = cars_data.copy(deep = False)

# or

sample_cars_data = cars_data

### Deep copy

In case of deep copy, a copy of
object is copied in other object
with no reference to the original

Any changes made to a copy of
object will not be relected in
the original object

In [4]:
sample_cars_data1 = cars_data.copy(deep = True)

## Attributes of a dataframe

### 1. Index: 
to get the index (row labels) of the dataframe

In [5]:
cars_data.index

Int64Index([   0,    1,    2,    3,    4,    5,    6,    7,    8,    9,
            ...
            1426, 1427, 1428, 1429, 1430, 1431, 1432, 1433, 1434, 1435],
           dtype='int64', length=1436)

### 2. columns:
to get the column labels of the dataframe

In [6]:
cars_data.columns

Index(['Price', 'Age', 'KM', 'FuelType', 'HP', 'MetColor', 'Automatic', 'CC',
       'Doors', 'Weight'],
      dtype='object')

### 3. size:
to get the total no of elements from the df

In [7]:
cars_data.size

14360

### 4. shape:
to get the dimensionality of the df

In [8]:
cars_data.shape

(1436, 10)

### 4. Memory usage:
to get the memory usage of each column in bytes

In [9]:
cars_data.memory_usage()

Index        11488
Price        11488
Age          11488
KM           11488
FuelType      5744
HP            5744
MetColor     11488
Automatic    11488
CC           11488
Doors         5744
Weight       11488
dtype: int64

### 5. ndim:
The number of axes / array dimensions 

In [10]:
cars_data.ndim

2

## Indexing and selecting data

Python slicing Operator ‘[ ]’ and attribute / dot operator ‘.’ are used for indexing

Provides quick and easy access to pandas data structures

### 1. head([n]):

Returns the first n rows from the dataframe

By default head() returns first 5 rows

In [11]:
cars_data.head(6)

Unnamed: 0,Price,Age,KM,FuelType,HP,MetColor,Automatic,CC,Doors,Weight
0,13500,23.0,46986.0,Diesel,90,1.0,0,2000,three,1165
1,13750,23.0,72937.0,Diesel,90,1.0,0,2000,3,1165
2,13950,24.0,41711.0,Diesel,90,,0,2000,3,1165
3,14950,26.0,48000.0,Diesel,90,0.0,0,2000,3,1165
4,13750,30.0,38500.0,Diesel,90,0.0,0,2000,3,1170
5,12950,32.0,61000.0,Diesel,90,0.0,0,2000,3,1170


In [12]:
cars_data.head()

Unnamed: 0,Price,Age,KM,FuelType,HP,MetColor,Automatic,CC,Doors,Weight
0,13500,23.0,46986.0,Diesel,90,1.0,0,2000,three,1165
1,13750,23.0,72937.0,Diesel,90,1.0,0,2000,3,1165
2,13950,24.0,41711.0,Diesel,90,,0,2000,3,1165
3,14950,26.0,48000.0,Diesel,90,0.0,0,2000,3,1165
4,13750,30.0,38500.0,Diesel,90,0.0,0,2000,3,1170


### 2. tail([n]):

Returns the last n rows from the dataframe

By default tail() returns last 5 rows

In [13]:
cars_data.tail(6)

Unnamed: 0,Price,Age,KM,FuelType,HP,MetColor,Automatic,CC,Doors,Weight
1430,8450,80.0,23000.0,Petrol,86,0.0,0,1300,3,1015
1431,7500,,20544.0,Petrol,86,1.0,0,1300,3,1025
1432,10845,72.0,,Petrol,86,0.0,0,1300,3,1015
1433,8500,,17016.0,Petrol,86,0.0,0,1300,3,1015
1434,7250,70.0,,,86,1.0,0,1300,3,1015
1435,6950,76.0,1.0,Petrol,110,0.0,0,1600,5,1114


### Accessing scalar data

To access a scalar value, the fastest way is to use the 'at' and 'iat' methods

'at' provides label-based scalar lookups

'iat' provides integer-based lookups

In [14]:
cars_data.at[4, 'FuelType']

'Diesel'

In [15]:
cars_data.iat[4,3]

'Diesel'

To access a group of rows and columns by label(s) .loc[] can be used

In [16]:
cars_data.loc[[3,4,6],['FuelType', 'KM']]

Unnamed: 0,FuelType,KM
3,Diesel,48000.0
4,Diesel,38500.0
6,Diesel,


## Data types

The way information gets stored in a dataframe or
a python Object affects the analysis and outputs Of
calculations

There are two main types Of data:
numeric and character types

Numeric data types includes integers and floats
For example: integer — 10, float — 10.53

Strings are known as objects in pandas which can
store values that contain numbers and / or
characters
For example: ‘category1 ’

![image.png](attachment:image.png)

Pandas and base Python uses different names for data types.

‘64’ simply refers to the memory allocated to store data in each cell which
effectively relates to how many digits it can store in each “cell”.
64 bits is equivalent to 8 bytes

Allocating space ahead of time allows computers to optimize storage and
processing efficiency

### Checking the datatypes of each column

dtypes 
returns a series with the data type of
each column

In [17]:
cars_data.dtypes

Price          int64
Age          float64
KM           float64
FuelType      object
HP            object
MetColor     float64
Automatic      int64
CC             int64
Doors         object
Weight         int64
dtype: object

### Count of unique data types

get_dtype_counts() returns counts of
unique data types in the dataframe

In [18]:
cars_data.get_dtype_counts()

  """Entry point for launching an IPython kernel.


float64    3
int64      4
object     3
dtype: int64

### Selecting data based on data types

pandas.DataFrame.select_dtypes () returns a
subset of the columns from dataframe based on the column
dtypes

Syntax: DataFrame.select_dtypes(include=None, exclude=None)

In [19]:
cars_data.select_dtypes(include=['int64'])

Unnamed: 0,Price,Automatic,CC,Weight
0,13500,0,2000,1165
1,13750,0,2000,1165
2,13950,0,2000,1165
3,14950,0,2000,1165
4,13750,0,2000,1170
...,...,...,...,...
1431,7500,0,1300,1025
1432,10845,0,1300,1015
1433,8500,0,1300,1015
1434,7250,0,1300,1015


In [20]:
cars_data.select_dtypes(exclude=['int64', 'float64'])

Unnamed: 0,FuelType,HP,Doors
0,Diesel,90,three
1,Diesel,90,3
2,Diesel,90,3
3,Diesel,90,3
4,Diesel,90,3
...,...,...,...
1431,Petrol,86,3
1432,Petrol,86,3
1433,Petrol,86,3
1434,,86,3


### Concise summary of dataframe

info() returns a concise summary of a
dataframe

. data type of index k

. data type of columns

. count of non-null values

. memory usage

Syntax: DataFrame.info()

In [21]:
cars_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1436 entries, 0 to 1435
Data columns (total 10 columns):
Price        1436 non-null int64
Age          1336 non-null float64
KM           1421 non-null float64
FuelType     1336 non-null object
HP           1436 non-null object
MetColor     1286 non-null float64
Automatic    1436 non-null int64
CC           1436 non-null int64
Doors        1436 non-null object
Weight       1436 non-null int64
dtypes: float64(3), int64(4), object(3)
memory usage: 138.6+ KB


In [22]:
np.unique(cars_data['Doors'])

array(['2', '3', '4', '5', 'five', 'four', 'three'], dtype=object)

In [86]:
cars_data = pd.read_csv('Toyota.csv',index_col=0, na_values=["??","????"])
cars_data

Unnamed: 0,Price,Age,KM,FuelType,HP,MetColor,Automatic,CC,Doors,Weight
0,13500,23.0,46986.0,Diesel,90.0,1.0,0,2000,three,1165
1,13750,23.0,72937.0,Diesel,90.0,1.0,0,2000,3,1165
2,13950,24.0,41711.0,Diesel,90.0,,0,2000,3,1165
3,14950,26.0,48000.0,Diesel,90.0,0.0,0,2000,3,1165
4,13750,30.0,38500.0,Diesel,90.0,0.0,0,2000,3,1170
...,...,...,...,...,...,...,...,...,...,...
1431,7500,,20544.0,Petrol,86.0,1.0,0,1300,3,1025
1432,10845,72.0,,Petrol,86.0,0.0,0,1300,3,1015
1433,8500,,17016.0,Petrol,86.0,0.0,0,1300,3,1015
1434,7250,70.0,,,86.0,1.0,0,1300,3,1015


In [87]:
cars_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1436 entries, 0 to 1435
Data columns (total 10 columns):
Price        1436 non-null int64
Age          1336 non-null float64
KM           1421 non-null float64
FuelType     1336 non-null object
HP           1430 non-null float64
MetColor     1286 non-null float64
Automatic    1436 non-null int64
CC           1436 non-null int64
Doors        1436 non-null object
Weight       1436 non-null int64
dtypes: float64(4), int64(4), object(2)
memory usage: 112.2+ KB


## Converting variable’s data txges

astype() method is used to explicitly convert
data types from one to another
Syntax: DataFrame.astype(dtype)

In [88]:
cars_data['MetColor'] = cars_data['MetColor'].astype('object')
cars_data['Automatic'] = cars_data['Automatic'].astype('object')

In [89]:
cars_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1436 entries, 0 to 1435
Data columns (total 10 columns):
Price        1436 non-null int64
Age          1336 non-null float64
KM           1421 non-null float64
FuelType     1336 non-null object
HP           1430 non-null float64
MetColor     1286 non-null object
Automatic    1436 non-null object
CC           1436 non-null int64
Doors        1436 non-null object
Weight       1436 non-null int64
dtypes: float64(3), int64(3), object(4)
memory usage: 101.0+ KB


### nbytes()

nbytes () is used to get the total bytes
consumed by the elements of the columns

Syntax: ndarray.nbytes

In [90]:
cars_data['FuelType'].nbytes

5744

In [91]:
cars_data['FuelType'].astype('category').nbytes

1448

Category data type comsumes less space

## Cleaning data

- replace( ) is used to replace a value with the desired
value

- SyntaxzDataFrame. replace( [to_replace,
value, ...])

In [92]:
cars_data['Doors'].replace('three', 3, inplace=True)
cars_data['Doors'].replace('four', 4, inplace=True)
cars_data['Doors'].replace('five', 5, inplace=True)

cars_data['Doors'] = cars_data['Doors'].astype('int64')

In [93]:
cars_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1436 entries, 0 to 1435
Data columns (total 10 columns):
Price        1436 non-null int64
Age          1336 non-null float64
KM           1421 non-null float64
FuelType     1336 non-null object
HP           1430 non-null float64
MetColor     1286 non-null object
Automatic    1436 non-null object
CC           1436 non-null int64
Doors        1436 non-null int64
Weight       1436 non-null int64
dtypes: float64(3), int64(4), object(3)
memory usage: 106.6+ KB


## To detect missing values

To check the count of missing values present in each column

Dataframe.isnull.sum() is used

In [94]:
cars_data.isnull().sum()

Price          0
Age          100
KM            15
FuelType     100
HP             6
MetColor     150
Automatic      0
CC             0
Doors          0
Weight         0
dtype: int64

In [95]:
cars_data.insert(10, 'Price_Class', '')

In [96]:
cars_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1436 entries, 0 to 1435
Data columns (total 11 columns):
Price          1436 non-null int64
Age            1336 non-null float64
KM             1421 non-null float64
FuelType       1336 non-null object
HP             1430 non-null float64
MetColor       1286 non-null object
Automatic      1436 non-null object
CC             1436 non-null int64
Doors          1436 non-null int64
Weight         1436 non-null int64
Price_Class    1436 non-null object
dtypes: float64(3), int64(4), object(4)
memory usage: 112.2+ KB


In [97]:
for i in range(0, len(cars_data['Price']), 1):
    if cars_data['Price'][i] <= 8450:
        cars_data['Price_Class'][i] = 'Low'
    elif cars_data['Price'][i] > 11950:
        cars_data['Price_Class'][i] = 'High'
    else:
        cars_data['Price_Class'][i] = 'Medium'

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  import sys


In [98]:
cars_data['Price_Class']

0         High
1         High
2         High
3         High
4         High
         ...  
1431       Low
1432    Medium
1433    Medium
1434       Low
1435       Low
Name: Price_Class, Length: 1436, dtype: object

### Series.value_counts() 

returns series
containing count of unique values

In [99]:
cars_data['Price_Class'].value_counts()

Medium    751
Low       369
High      316
Name: Price_Class, dtype: int64

In [100]:
cars_data['Age']

0       23.0
1       23.0
2       24.0
3       26.0
4       30.0
        ... 
1431     NaN
1432    72.0
1433     NaN
1434    70.0
1435    76.0
Name: Age, Length: 1436, dtype: float64

In [101]:
def convertAgeToYears(series):
    return series/12

In [102]:
cars_data.insert(11, 'Age_Converted', 0)

In [105]:
cars_data['Age_Converted'] = round(convertAgeToYears(cars_data['Age']), 1)

In [106]:
cars_data['Age_Converted']

0       1.9
1       1.9
2       2.0
3       2.2
4       2.5
       ... 
1431    NaN
1432    6.0
1433    NaN
1434    5.8
1435    6.3
Name: Age_Converted, Length: 1436, dtype: float64