 The data does not always come out of memory in the format we want it. This notebook covers the operations that we need to doo just to make our data RIGHT!

In [3]:
import pandas as pd

In [8]:
data = pd.read_csv('new.csv')
data.head(3)

Unnamed: 0,Brand,Year,Kms Driven,City,Mileage
0,Maruti,,50000,Gurgaon,28
1,Hyundai,2014.0,30000,Delhi,27
2,Tata,,60000,,25


# Summary Functions:

In [9]:
data.Mileage.describe()

count     9.000000
mean     25.777778
std       2.538591
min      21.000000
25%      24.000000
50%      26.000000
75%      28.000000
max      29.000000
Name: Mileage, dtype: float64

In [10]:
data.Brand.describe() #simple summary statistic

count          9
unique         5
top       Maruti
freq           3
Name: Brand, dtype: object

In [11]:
data.City.unique() #To see a list of unique values

array(['Gurgaon', 'Delhi', nan, 'Ghaziabad'], dtype=object)

In [12]:
data.City.value_counts() #how often they occur in the dataset

Delhi        3
Gurgaon      1
Ghaziabad    1
Name: City, dtype: int64

In [13]:
data.describe() #Summary Statistics

Unnamed: 0,Year,Kms Driven,Mileage
count,6.0,9.0,9.0
mean,2016.0,31000.0,25.777778
std,2.097618,17755.280905,2.538591
min,2014.0,10000.0,21.0
25%,2014.25,15000.0,24.0
50%,2015.5,30000.0,26.0
75%,2017.5,46000.0,28.0
max,2019.0,60000.0,29.0


# MAPS

 In data science we often have a need for creating new representations from existing data, or for transforming data from the format it is in now to the format that we want it to be in later. Maps are what handle this work, making them extremely important for getting your work done!

There are two mapping functions which are frequently used:
    
    1. map():
    2. apply():

The function you pass to map() should expect a single value from the Series (a point value, in the above example), and return a transformed version of that value. map() returns a new Series where all the values have been transformed by your function.

In [17]:
mileage_mean = data.Mileage.mean()
mileage_mean

25.77777777777778

In [19]:
data.Mileage.map(lambda p:p - mileage_mean)
#It tansforms Mileage in data in form mileage - mean mileage

0    2.222222
1    1.222222
2   -0.777778
3    0.222222
4    2.222222
5    3.222222
6   -1.777778
7   -4.777778
8   -1.777778
Name: Mileage, dtype: float64

1. apply() transforms the whole DataFrame by calling a custom method on each row.
2. map() transforms the single series or column to say

In [21]:
def remean_Mileage(row):
    row.Mileage = row.Mileage - data.Mileage.mean()
    return row

data.apply(remean_Mileage, axis ='columns')

Unnamed: 0,Brand,Year,Kms Driven,City,Mileage
0,Maruti,,50000,Gurgaon,2.222222
1,Hyundai,2014.0,30000,Delhi,1.222222
2,Tata,,60000,,-0.777778
3,Mahindra,2015.0,25000,Delhi,0.222222
4,Maruti,,10000,,2.222222
5,Hyundai,2016.0,46000,Delhi,3.222222
6,Renault,2014.0,31000,,-1.777778
7,Tata,2018.0,15000,,-4.777778
8,Maruti,2019.0,12000,Ghaziabad,-1.777778


### If we had called data.apply() with axis='index', then instead of passing a function to transform each row, we would need to give a function to transform each column.

Note that map() and apply() return new, transformed Series and DataFrames, respectively. They don't modify the original data they're called on. 

More easy Way: of remean the Mileage:

In [22]:
data.Mileage - data.Mileage.mean()

0    2.222222
1    1.222222
2   -0.777778
3    0.222222
4    2.222222
5    3.222222
6   -1.777778
7   -4.777778
8   -1.777778
Name: Mileage, dtype: float64

In [23]:
data.Brand + '-' + data.City #Concat

0      Maruti-Gurgaon
1       Hyundai-Delhi
2                 NaN
3      Mahindra-Delhi
4                 NaN
5       Hyundai-Delhi
6                 NaN
7                 NaN
8    Maruti-Ghaziabad
dtype: object

In [25]:
data.Mileage.idxmax() #Will return the index of Max in the given column

5

In [27]:
data.Mileage.idxmin() #Will return the index of Min in the given column

7

# Question: Which is the best Mileage Car for Me?

In [29]:
data_id = data.Mileage.idxmax()
data_id

In [31]:
car = data.loc[data_id,'Brand']
car

'Hyundai'

In [37]:
data

Unnamed: 0,Brand,Year,Kms Driven,City,Mileage
0,Maruti,,50000,Gurgaon,28
1,Hyundai,2014.0,30000,Delhi,27
2,Tata,,60000,,25
3,Mahindra,2015.0,25000,Delhi,26
4,Maruti,,10000,,28
5,Hyundai,2016.0,46000,Delhi,29
6,Renault,2014.0,31000,,24
7,Tata,2018.0,15000,,21
8,Maruti,2019.0,12000,Ghaziabad,24


# Question: Create a Series that counts how many times the word Maruti and Tata appears in our City?

In [40]:
Maruti_count= data.Brand.map(lambda x: 'Maruti' in x).sum()
Tata_count = data.Brand.map(lambda x: 'Tata' in x).sum()
total_count = pd.Series([Maruti_count,Tata_count],index = ['Maruti','Tata'])
total_count

Maruti    3
Tata      2
dtype: int64