## Viewing / Exploring Data

In [1]:
import pandas as pd

#### Reading a file / Importing Data

In [2]:
df = pd.read_csv("car.csv")
df

Unnamed: 0,Make,Colour,Odometer (KM),Doors,Price
0,Toyota,White,150043,4,"$4,000.00"
1,Honda,Red,87899,4,"$5,000.00"
2,Toyota,Blue,32549,3,"$7,000.00"
3,BMW,Black,11179,5,"$22,000.00"
4,Nissan,White,213095,4,"$3,500.00"
5,Toyota,Green,99213,4,"$4,500.00"
6,Honda,Blue,45698,4,"$7,500.00"
7,Honda,Blue,54738,4,"$7,000.00"
8,Toyota,White,60000,4,"$6,250.00"
9,Nissan,White,31600,4,"$9,700.00"


In [3]:
df.shape # gives the output as number of rows & columns in a dataframe.

(10, 5)

In [4]:
df.head(3) # if number is not passed, by default it shows top 5 rows of df. 

Unnamed: 0,Make,Colour,Odometer (KM),Doors,Price
0,Toyota,White,150043,4,"$4,000.00"
1,Honda,Red,87899,4,"$5,000.00"
2,Toyota,Blue,32549,3,"$7,000.00"


In [5]:
df.tail(2) # if number is not passed, by default it shows last 5 rows of df. 

Unnamed: 0,Make,Colour,Odometer (KM),Doors,Price
8,Toyota,White,60000,4,"$6,250.00"
9,Nissan,White,31600,4,"$9,700.00"


In [6]:
df.index # shows the index

RangeIndex(start=0, stop=10, step=1)

In [7]:
df.columns # shows the columns

Index(['Make', 'Colour', 'Odometer (KM)', 'Doors', 'Price'], dtype='object')

In [8]:
df.dtypes # to check the datatypes of each column

Make             object
Colour           object
Odometer (KM)     int64
Doors             int64
Price            object
dtype: object

In [9]:
df.info() # it shows a quick information about numbers of rows, null values, data type

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 5 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   Make           10 non-null     object
 1   Colour         10 non-null     object
 2   Odometer (KM)  10 non-null     int64 
 3   Doors          10 non-null     int64 
 4   Price          10 non-null     object
dtypes: int64(2), object(3)
memory usage: 528.0+ bytes


In [10]:
df.describe() # it shows a quick statistic summary of df(Columns with numeric data type)

Unnamed: 0,Odometer (KM),Doors
count,10.0,10.0
mean,78601.4,4.0
std,61983.471735,0.471405
min,11179.0,3.0
25%,35836.25,4.0
50%,57369.0,4.0
75%,96384.5,4.0
max,213095.0,5.0


__Note :__ Albeit Price is a numeric datatype but while importing due to presence  of "$" & "," it got considered as string(object). We can replace these charcters to cast its dattype into float or integer. We can use regex or simple string methods.

In [11]:
# let's replace characters from there
df['Price']= df['Price'].apply(lambda x : x.replace('$',''))
df['Price'] = df['Price'].apply(lambda x : x.replace(',',''))
df['Price'] = df['Price'].astype(float)

In [12]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 5 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Make           10 non-null     object 
 1   Colour         10 non-null     object 
 2   Odometer (KM)  10 non-null     int64  
 3   Doors          10 non-null     int64  
 4   Price          10 non-null     float64
dtypes: float64(1), int64(2), object(2)
memory usage: 528.0+ bytes


In [13]:
df.describe() # now it takes price into consideration

Unnamed: 0,Odometer (KM),Doors,Price
count,10.0,10.0,10.0
mean,78601.4,4.0,7645.0
std,61983.471735,0.471405,5379.407753
min,11179.0,3.0,3500.0
25%,35836.25,4.0,4625.0
50%,57369.0,4.0,6625.0
75%,96384.5,4.0,7375.0
max,213095.0,5.0,22000.0


In [14]:
df[["Doors", "Price"]].mean() # we can also find stats of particular column/columns

Doors       4.0
Price    7645.0
dtype: float64

In [15]:
df["Price"].mean()

7645.0

In [16]:
df.isna() # it gives a boolean representation of null values in df.

Unnamed: 0,Make,Colour,Odometer (KM),Doors,Price
0,False,False,False,False,False
1,False,False,False,False,False
2,False,False,False,False,False
3,False,False,False,False,False
4,False,False,False,False,False
5,False,False,False,False,False
6,False,False,False,False,False
7,False,False,False,False,False
8,False,False,False,False,False
9,False,False,False,False,False


In [17]:
df.isna().sum() # to calculate the null values across columns.

Make             0
Colour           0
Odometer (KM)    0
Doors            0
Price            0
dtype: int64

In [18]:
df.isnull().sum()

Make             0
Colour           0
Odometer (KM)    0
Doors            0
Price            0
dtype: int64

In [19]:
df.isna().any().sum() # to calculate the null values across dataframe.

0

## Filtering Data

- We can select columns/rows using indexing methods which is covered in [Introduction to Pandas Notebook](https://github.com/afaqueumer/Pandas-Playground/blob/main/Introduction%20to%20Pandas.ipynb)
    
- We can use the __`logical operators`__, __`isin`__, __`str functions`__ etc. on column values to filter rows.

In [20]:
# this will return boolean values, on applying to df --> only those rows will appear where values will be True 
df.Price > 5000 

0    False
1    False
2     True
3     True
4    False
5    False
6     True
7     True
8     True
9     True
Name: Price, dtype: bool

In [21]:
df[df.Price > 5000]

Unnamed: 0,Make,Colour,Odometer (KM),Doors,Price
2,Toyota,Blue,32549,3,7000.0
3,BMW,Black,11179,5,22000.0
6,Honda,Blue,45698,4,7500.0
7,Honda,Blue,54738,4,7000.0
8,Toyota,White,60000,4,6250.0
9,Nissan,White,31600,4,9700.0


In [22]:
# multiple conditions

df[(df.Price > 3000) & (df.Colour == 'Blue') & (df.Doors == 4) ]

Unnamed: 0,Make,Colour,Odometer (KM),Doors,Price
6,Honda,Blue,45698,4,7500.0
7,Honda,Blue,54738,4,7000.0


  The __`isin`__ method is another way of applying multiple condition for filtering.

In [23]:
prices = [7000, 7500]
df[df.Price.isin(prices)]

Unnamed: 0,Make,Colour,Odometer (KM),Doors,Price
2,Toyota,Blue,32549,3,7000.0
6,Honda,Blue,45698,4,7500.0
7,Honda,Blue,54738,4,7000.0


  The functions and methods under the __`str accessor`__ provide flexible ways to filter rows based on strings.

In [24]:
df[df.Make.str.startswith('H')]

Unnamed: 0,Make,Colour,Odometer (KM),Doors,Price
1,Honda,Red,87899,4,5000.0
6,Honda,Blue,45698,4,7500.0
7,Honda,Blue,54738,4,7000.0


__`nlargest`__ or __`nsmallest`__ 

     We specify the number of largest or smallest values to be selected and the name of the column.

In [25]:
df.nlargest(3, 'Price')

Unnamed: 0,Make,Colour,Odometer (KM),Doors,Price
3,BMW,Black,11179,5,22000.0
9,Nissan,White,31600,4,9700.0
6,Honda,Blue,45698,4,7500.0


In [26]:
df.nsmallest(3, 'Price')

Unnamed: 0,Make,Colour,Odometer (KM),Doors,Price
4,Nissan,White,213095,4,3500.0
0,Toyota,White,150043,4,4000.0
5,Toyota,Green,99213,4,4500.0
