# Basic Metrics

When we think about summarizing data, what are the metrics that we look at?

In this notebook, we will look at the car dataset

To read how the data was acquired, please read [this repo](https://github.com/amitkaps/cars) to get more information


In [1]:
#Import the required libraries
import numpy as np
import pandas as pd
from datetime import datetime as dt
from scipy import stats

### Read the dataset


In [3]:
cars = pd.read_csv("cars_v1.csv", encoding = "ISO-8859-1")

### Warm up

In [5]:
#Display the first 5 records


Unnamed: 0,Make,Model,Price,Type,ABS,BootSpace,GearType,AirBag,Engine,FuelCapacity,Mileage
0,Ashok Leyland Stile,Ashok Leyland Stile LE 8-STR (Diesel),750,MPV,No,500.0,Manual,No,1461.0,50.0,20.7
1,Ashok Leyland Stile,Ashok Leyland Stile LS 8-STR (Diesel),800,MPV,No,500.0,Manual,No,1461.0,50.0,20.7
2,Ashok Leyland Stile,Ashok Leyland Stile LX 8-STR (Diesel),830,MPV,No,500.0,Manual,No,1461.0,50.0,20.7
3,Ashok Leyland Stile,Ashok Leyland Stile LS 7-STR (Diesel),850,MPV,No,500.0,Manual,No,1461.0,50.0,20.7
4,Ashok Leyland Stile,Ashok Leyland Stile LS 7-STR Alloy (Diesel),880,MPV,No,500.0,Manual,No,1461.0,50.0,20.7


**Exercise**

In [None]:
#Display the first 10 records

In [7]:
#Display the last 5 records

In [8]:
#Find the number of rows and columns in the dataset

In [9]:
#What are the column names in the dataset?

In [10]:
#What are the types of those columns ? 

In [11]:
cars.head()

Unnamed: 0,Make,Model,Price,Type,ABS,BootSpace,GearType,AirBag,Engine,FuelCapacity,Mileage
0,Ashok Leyland Stile,Ashok Leyland Stile LE 8-STR (Diesel),750,MPV,No,500.0,Manual,No,1461.0,50.0,20.7
1,Ashok Leyland Stile,Ashok Leyland Stile LS 8-STR (Diesel),800,MPV,No,500.0,Manual,No,1461.0,50.0,20.7
2,Ashok Leyland Stile,Ashok Leyland Stile LX 8-STR (Diesel),830,MPV,No,500.0,Manual,No,1461.0,50.0,20.7
3,Ashok Leyland Stile,Ashok Leyland Stile LS 7-STR (Diesel),850,MPV,No,500.0,Manual,No,1461.0,50.0,20.7
4,Ashok Leyland Stile,Ashok Leyland Stile LS 7-STR Alloy (Diesel),880,MPV,No,500.0,Manual,No,1461.0,50.0,20.7


In [18]:
#How to check if there are null values in any of the columns?

#Hint: use the isnull() function  (how about using sum or values/any with it?)

Make              0
Model             0
Price             0
Type              0
ABS              16
BootSpace       179
GearType         16
AirBag           21
Engine            7
FuelCapacity      0
Mileage         171
dtype: int64

**How to handle missing values?**

In [22]:
#fillna function

# Mean, Median, Variance, Standard Deviation

#### Mean

arithmetic average of a range of values or quantities, computed by dividing the total of all values by the number of values.

In [25]:
#Find mean of price


3159.4957983193276

In [26]:
#Find mean of Mileage


17.480407854984882

Let's do something fancier.
Let's find mean mileage of every make. 

*Hint*: need to use `groupby`

In [39]:
#cars.groupby('Make') : Finish the code


Unnamed: 0,Make,Mileage
0,Ashok Leyland Stile,20.700000
1,Aston Martin Rapide,7.000000
2,Aston Martin Rapide S,11.900000
3,Aston Martin V12 Vantage,9.000000
4,Aston Martin V8 Vantage,5.000000
5,Aston Martin Vanquish,8.000000
6,Audi A3 Cabriolet,19.186667
7,Audi A4,14.804000
8,Audi A6,15.260000
9,Audi A7,14.400000


### Exercise

**How about finding the average mileage for every `Type-GearType` combination?**

Unnamed: 0,Type,GearType,Mileage
0,Convertible,Automatic,12.147143
1,Coupe,Automatic,9.746071
2,Hatchback,Automatic,19.446111
3,Hatchback,Manual,20.804878
4,Hatchback,No,21.748
5,MPV,Automatic,20.434286
6,MPV,Manual,18.825862
7,MUV,Automatic,12.466667
8,MUV,Manual,15.634091
9,MUV,No,13.912


#### Median

Denotes value or quantity lying at the midpoint of a frequency distribution of observed values or quantities, such that there is an equal probability of falling above or below it. Simply put, it is the *middle* value in the list of numbers.

If count is odd, the median is the value at (n+1)/2,

else it is the average of n/2 and (n+1)/2

**Find median of mileage**

17.985

#### Mode

It is the number which appears most often in a set of numbers. 

**Find the mode of `Type` of cars**

In [55]:
#Let's first find count of each of the car Types
#Hint: use value_counts

Sedan          294
Hatchback      222
SUV            186
MPV             47
MUV             40
Coupe           33
Convertible     11
Name: Type, dtype: int64

In [56]:
#Mode of cars

0    Sedan
dtype: object

#### Variance

> Once two statistician of height 4 feet and 5 feet have to cross a river of AVERAGE depth 3 feet. Meanwhile, a third person comes and said, "what are you waiting for? You can easily cross the river"

It's the average distance of the data values from the *mean*

<img style="float: left;" src="img/variance.png" height="320" width="320">

**Find variance of mileage**

21.018811179847457

#### Standard Deviation

It is the square root of variance. This will have the same units as the data and mean. 

**Find standard deviation of mileage**

4.584627703516116

#### Using Pandas built-in function

In [None]:
cars.describe()

#### Co-variance 

covariance as a measure of the (average) co-variation between two variables, say x and y. Covariance describes both how far the variables are spread out, and the nature of their relationship, Covariance is a measure of how much two variables change together. Compare this to Variance, which is just the range over which one measure (or variable) varies.

<img style="float: left;" src="img/covariance.png" height="270" width="270">

<br>
<br>
<br>
<br>



#### Co-variance of mileage of Automatic and Manual Gear Type

In [67]:
pd.unique(cars.GearType)

array([' Manual', ' Automatic', nan, ' No'], dtype=object)

In [82]:
cars_Automatic = cars[cars.GearType==' Automatic'].copy().reset_index()

In [83]:
cars_Manual = cars[cars.GearType==' Manual'].copy().reset_index()

In [84]:
cars_Automatic.head()

Unnamed: 0,index,Make,Model,Price,Type,ABS,BootSpace,GearType,AirBag,Engine,FuelCapacity,Mileage
0,7,Aston Martin Rapide,Aston Martin Rapide LUXE (Petrol),35000,Sedan,Yes,300.0,Automatic,Yes,5935.0,90.5,7.0
1,8,Aston Martin Rapide S,Aston Martin Rapide S (Petrol),44000,Sedan,Yes,,Automatic,Yes,5935.0,90.0,11.9
2,9,Aston Martin V12 Vantage,Aston Martin V12 Vantage Coupe (Petrol),35000,Coupe,Yes,300.0,Automatic,Yes,5935.0,80.0,9.0
3,10,Aston Martin V8 Vantage,Aston Martin V8 Vantage Coupe (Petrol),13500,Coupe,Yes,300.0,Automatic,Yes,4735.0,80.0,5.0
4,11,Aston Martin V8 Vantage,Aston Martin V8 Vantage S Coupe (Petrol),25500,Coupe,Yes,300.0,Automatic,Yes,4735.0,80.0,5.0


In [85]:
cars_Manual.head()

Unnamed: 0,index,Make,Model,Price,Type,ABS,BootSpace,GearType,AirBag,Engine,FuelCapacity,Mileage
0,0,Ashok Leyland Stile,Ashok Leyland Stile LE 8-STR (Diesel),750,MPV,No,500.0,Manual,No,1461.0,50.0,20.7
1,1,Ashok Leyland Stile,Ashok Leyland Stile LS 8-STR (Diesel),800,MPV,No,500.0,Manual,No,1461.0,50.0,20.7
2,2,Ashok Leyland Stile,Ashok Leyland Stile LX 8-STR (Diesel),830,MPV,No,500.0,Manual,No,1461.0,50.0,20.7
3,3,Ashok Leyland Stile,Ashok Leyland Stile LS 7-STR (Diesel),850,MPV,No,500.0,Manual,No,1461.0,50.0,20.7
4,4,Ashok Leyland Stile,Ashok Leyland Stile LS 7-STR Alloy (Diesel),880,MPV,No,500.0,Manual,No,1461.0,50.0,20.7


In [86]:
cars_Manual.shape

(421, 12)

In [87]:
cars_Automatic.shape

(372, 12)

The number of observations have to be same. For the current exercise, let's take the first 300 observations in both the datasets

In [91]:
cars_Automatic = cars_Automatic.ix[:299,:]
cars_Manual = cars_Manual.ix[:299,:]

In [92]:
cars_Automatic.shape

(300, 12)

In [93]:
cars_Manual.shape

(300, 12)

In [98]:
cars_manual_automatic = pd.DataFrame([cars_Automatic.Mileage, cars_Manual.Mileage])

In [97]:
cars_manual_automatic

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,290,291,292,293,294,295,296,297,298,299
Mileage,7.0,11.9,9.0,5.0,5.0,5.0,8.0,16.6,17.0,16.55,...,13.1,13.1,17.2,13.7,17.2,14.8,,,,19.3
Mileage,20.7,20.7,20.7,20.7,20.7,20.7,20.7,17.32,13.7,13.7,...,20.5,20.45,20.5,19.0,19.01,19.01,19.01,13.05,19.87,19.87


In [99]:
cars_manual_automatic = cars_manual_automatic.T

In [101]:
cars_manual_automatic.head()

Unnamed: 0,Mileage,Mileage.1
0,7.0,20.7
1,11.9,20.7
2,9.0,20.7
3,5.0,20.7
4,5.0,20.7


In [103]:
cars_manual_automatic.columns = ['Mileage_Automatic', 'Mileage_Manual']

In [104]:
cars_manual_automatic.head()

Unnamed: 0,Mileage_Automatic,Mileage_Manual
0,7.0,20.7
1,11.9,20.7
2,9.0,20.7
3,5.0,20.7
4,5.0,20.7


In [105]:
#Co-variance matrix between the mileages of automatic and manual:


Unnamed: 0,Mileage_Automatic,Mileage_Manual
Mileage_Automatic,22.375515,0.446292
Mileage_Manual,0.446292,12.776373


### Correlation

Extent to which two or more variables fluctuate together. A positive correlation indicates the extent to which those variables increase or decrease in parallel; a negative correlation indicates the extent to which one variable increases as the other decreases.

<img style="float: left;" src="img/correlation.gif" height="270" width="270">

<br>
<br>
<br>



In [106]:
#### Find the correlation between the mileages of automatic and manual in the above dataset

Unnamed: 0,Mileage_Automatic,Mileage_Manual
Mileage_Automatic,1.0,0.026011
Mileage_Manual,0.026011,1.0


# Correlation != Causation

correlation between two variables does not necessarily imply that one causes the other.


<img style="float: left;" src="img/correlation_not_causation.gif" height="570" width="570">