# Basic Metrics

When we think about summarizing data, what are the metrics that we look at?

In this notebook, we will look at the car dataset

To read how the data was acquired, please read [this repo](https://github.com/amitkaps/cars) to get more information


In [None]:
#Import the required libraries
import numpy as np
import pandas as pd
from datetime import datetime as dt
from scipy import stats

### Read the dataset


In [None]:
cars = pd.read_csv("cars_v1.csv", encoding = "ISO-8859-1")

### Warm up

In [None]:
cars.head()


**Exercise**

In [None]:
#Display the first 10 records
cars.head(10)

In [None]:
#Display the last 5 records
cars.tail()

In [None]:
#Find the number of rows and columns in the dataset
cars.shape

In [None]:
#What are the column names in the dataset?
cars.columns

In [None]:
#What are the types of those columns ? 
cars.dtypes

In [None]:
cars.head()

In [None]:
#How to check if there are null values in any of the columns?

#Hint: use the isnull() function  (how about using sum or values/any with it?)
cars.isnull().sum()

**How to handle missing values?**

In [None]:
#fillna function


# Mean, Median, Variance, Standard Deviation

#### Mean

arithmetic average of a range of values or quantities, computed by dividing the total of all values by the number of values.

In [None]:
#Find mean of price
cars.Price.mean()

In [None]:
#Find mean of Mileage
cars.Mileage.mean()

Let's do something fancier.
Let's find mean mileage of every make. 

*Hint*: need to use `groupby`

In [None]:
#cars.groupby('Make') : Finish the code
cars.groupby('Make').Mileage.mean().reset_index()

### Exercise

**How about finding the average mileage for every `Type-GearType` combination?**

#### Median

Denotes value or quantity lying at the midpoint of a frequency distribution of observed values or quantities, such that there is an equal probability of falling above or below it. Simply put, it is the *middle* value in the list of numbers.

If count is odd, the median is the value at (n+1)/2,

else it is the average of n/2 and (n+1)/2

**Find median of mileage**

In [None]:
cars.Mileage.median()

#### Mode

It is the number which appears most often in a set of numbers. 

**Find the mode of `Type` of cars**

In [None]:
#Let's first find count of each of the car Types
#Hint: use value_counts

In [None]:
cars.Type.value_counts()

In [None]:
#Mode of cars

In [None]:
cars.Type

In [None]:
cars.Type.mode()

In [None]:
cars.head()

#### Variance

> Once two statistician of height 4 feet and 5 feet have to cross a river of AVERAGE depth 3 feet. Meanwhile, a third person comes and said, "what are you waiting for? You can easily cross the river"

It's the average distance of the data values from the *mean*

<img style="float: left;" src="img/variance.png" height="320" width="320">

**Find variance of mileage**

In [None]:
cars.Mileage.var()

#### Standard Deviation

It is the square root of variance. This will have the same units as the data and mean. 

**Find standard deviation of mileage**

In [None]:
cars.Mileage.std()

#### Using Pandas built-in function

In [None]:
cars.describe()

#### Co-variance 

covariance as a measure of the (average) co-variation between two variables, say x and y. Covariance describes both how far the variables are spread out, and the nature of their relationship, Covariance is a measure of how much two variables change together. Compare this to Variance, which is just the range over which one measure (or variable) varies.

<img style="float: left;" src="img/covariance.png" height="270" width="270">

<br>
<br>
<br>
<br>



#### Co-variance of mileage of Automatic and Manual Gear Type

In [None]:
pd.unique(cars.GearType)

In [None]:
cars_Automatic = cars[cars.GearType==' Automatic'].copy().reset_index()

In [None]:
cars_Manual = cars[cars.GearType==' Manual'].copy().reset_index()

In [None]:
cars_Automatic.head()

In [None]:
cars_Manual.head()

In [None]:
cars_Manual.shape

In [None]:
cars_Automatic.shape

The number of observations have to be same. For the current exercise, let's take the first 300 observations in both the datasets

In [None]:
cars_Automatic = cars_Automatic.ix[:299,:]
cars_Manual = cars_Manual.ix[:299,:]

In [None]:
cars_Automatic.shape

In [None]:
cars_Manual.shape

In [None]:
cars_manual_automatic = pd.DataFrame([cars_Automatic.Mileage, cars_Manual.Mileage])

In [None]:
cars_manual_automatic

In [None]:
cars_manual_automatic = cars_manual_automatic.T

In [None]:
cars_manual_automatic.head()

In [None]:
cars_manual_automatic.columns = ['Mileage_Automatic', 'Mileage_Manual']

In [None]:
cars_manual_automatic.head()

In [None]:
#Co-variance matrix between the mileages of automatic and manual:
cars_manual_automatic.cov()

### Correlation

Extent to which two or more variables fluctuate together. A positive correlation indicates the extent to which those variables increase or decrease in parallel; a negative correlation indicates the extent to which one variable increases as the other decreases.

<img style="float: left;" src="img/correlation.gif" height="270" width="270">

<br>
<br>
<br>



In [None]:
#### Find the correlation between the mileages of automatic and manual in the above dataset

In [None]:
cars_manual_automatic.corr()

In [None]:
cars_manual_automatic.corrwith?


# Correlation != Causation

correlation between two variables does not necessarily imply that one causes the other.


<img style="float: left;" src="img/correlation_not_causation.gif" height="570" width="570">