# Seaborn
Matplotlib is not designed for use with Pandas dataframes. In order to visualize data from a Pandas dataframe, you must extract each series and often concatenate these series’ together into the right format.

Seaborn provides an API on top of matplotlib which uses sane plot & color defaults, uses simple functions for common statistical plot types, and which integrates with the functionality provided by Pandas dataframes

<img src="https://d39sghb3udgxv0.cloudfront.net/300812017/05/1494508552.jpg">

## Import packages

In [2]:
# Matplotlib for additional customization
from matplotlib import pyplot as plt

# Importing Pandas
import pandas as pd
import random

# Seaborn for plotting and styling
#!conda install seaborn
import seaborn as sns



## Color Palettes and Style

In [3]:
# Default settings
# sns.set()
sns.set(style="darkgrid")

## Import Sample Data
**Car Sale Advertisements**
Data collected from private car sale advertisements in Ukraine

This dataset contains data for more than 9.5K cars sale in Ukraine. Most of them are used cars so it opens the possibility to analyze features related to car operation. At the end of the day I look at this data as a subset from all Ukrainian car fleet.

**Content**

Dataset contains 9576 rows and 10 variables with essential meanings:

- car: manufacturer brand
- price: seller’s price in advertisement (in USD)
- body: car body type
- mileage: as mentioned in advertisement (‘000 Km)
- engV: rounded engine volume (‘000 cubic cm)
- engType: type of fuel (“Other” in this case should be treated as NA)
- registration: whether car registered in Ukraine or not
- year: year of production
- model: specific model name
- drive: drive type
- Data has gaps, so be careful and check for NA’s. I tried to check and drop repeated offers, but theoretically duplications are possible.

In [None]:
car_dataset_url ='https://raw.githubusercontent.com/ankitind/sample_datasets/master/car_ad.csv'
car_ads = pd.read_csv(car_dataset_url)

## Playing with the data
Once the data is imported explore the data
Use panda functions like

- df.head()
- df.shape()
- df.describe()
- df.index()
- df.columns
- df.info()
- df.count()

In [None]:
car_ads.head()

In [None]:
#Get number of rows and columns 
car_ads.shape

In [None]:
#Describe Index
car_ads.index

In [None]:
#Describe Columns
car_ads.columns

In [None]:
#Info on DataFrame
car_ads.info()

In [None]:
#Number of non-NA values
car_ads.count()

In [None]:
car_ads.describe(include = 'all')

In [None]:
#Drop missing observations
car_ads_no_missing = car_ads.dropna()

In [None]:
car_ads_no_missing.describe(include = 'all')

In [None]:
car_ads_no_missing.info()

## 1. Distribution

It is commonly used at the initial stage of data exploration i.e. when we get started with understanding the variable. Variables are of two types: Continuous and Categorical. For continuous variable, we look at the centre, spread, outlier. For categorical variable we look at frequency table. Visualization types used to represent these are:-
When dealing with a set of data, often the first thing you’ll want to do is get a sense for how the variables are distributed.

Examining df.describe() , car_ads_no_missing.describe(), we find 
Columns (car, body, engType, registration, model, drive) are categorical variables and
Columns (price, mileage, engV, year) are continuous variables


### a. Histogram
It is used for showing the distribution of continuous variables. 
One of the catch with histogram is ‘number of bins’. Let’s understand it in detail using example below:

In [4]:
#def is_categorical(array_like):
#    return str(array_like.dtype) == 'category'

#print(isinstance(car_ads_no_missing.body, pd.core.common.CategoricalDtype))
#print(isinstance(car_ads_no_missing.body, pd.core.common.CategoricalDtype))
#print(isinstance(car_ads_no_missing.price, pd.core.common.CategoricalDtype))
#print(car_ads_no_missing.price.dtype)
#print(car_ads_no_missing.body.dtype)
# print(is_categorical(car_ads_no_missing.car))
# print(car_ads_no_missing.select_dtypes(include=['category']).dtypes)
#car_ads_no_missing.df.datatypes
#print(is_categorical(car_ads_no_missing.body))
#str(car_ads_no_missing.body.dtype)
#print(pd.core.common.is_categorical_dtype(car_ads_no_missing.model))
#car_ads_no_missing
df = car_ads_no_missing
#df.info
#df['body'] = df['body'].astype('category') 
#print(pd.core.common.is_categorical_dtype(car_ads_no_missing.model))


char_cols = df.dtypes.pipe(lambda x: x[x == 'object']).index
label_mapping = {}

for c in char_cols:
    df[c], label_mapping[c] = pd.factorize(df[c])
    
df.info()

NameError: name 'car_ads_no_missing' is not defined

## Linear Model Plot in Seaborn
sns.lmplot('x', 'y', data=df, fit_reg=False)


In [None]:
sns.lmplot(x='year', y='price', data = car_ads_no_missing, fit_reg=False)
plt.show()


In [None]:
#With Different plots using col="Name" or rows="Name"
sns.lmplot(x='year', y='price', data = car_ads_no_missing, col='body', fit_reg=False)
plt.show()

In [None]:
#With Different plots using hue="columnName"
sns.lmplot(x='year', y='price', data = car_ads_no_missing, hue='body', fit_reg=False)
plt.show()

In [None]:
sns.residplot(x='mileage', y='price', data = car_ads_no_missing)
plt.show()

In [None]:
#With Different plots using hue="columnName"
sns.regplot(x='year', y='price', data = car_ads_no_missing, color='red')

sns.regplot(x='year', y='price', data = car_ads_no_missing, order = 2,  color='green')
plt.show()