# *HACKATHON DESCRIPTION*
---

A Nigerian automobile company, Great Motors, has just employed you as their lead data scientist for the analytics division.

Great Motors deals in used cars, with a huge market base in Nigeria. The company has a unique platform where customers can buy and sell cars. A seller posts details about the vehicle for review by the company’s mechanic on the platform to ascertain the vehicle's value. The company then lists the car for sale at the best price. Great Motors makes its profit by receiving a percentage of the selling price listed on the company platform. To ensure the car's selling price is the best for both the customer selling the vehicle and Great Motors, you have been assigned the task of coming up with a predictive model for determining the price of the car.

Your job is to predict the price the company should sell a car based on the available data the mechanics have submitted to you.

***Importing libraries needed***

In [5]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

sns.set()  # set plots to seaborn style

## The Data

There are ~7000 entries in Train and ~2000 entries in Test.

The objective of the challenge is to predict the price (Amount (Million Naira) the company should sell a car based on the available data (Location, Maker, Model, Year, Colour, Amount (Million Naira), Type, Distance). The objective is the predict the selling price.



In [3]:
!ls ../datasets

SampleSubmission.csv  Test.csv	Train.csv  VariableDefinitions.csv


In [6]:
train = pd.read_csv('../datasets/Train.csv')
test = pd.read_csv('../datasets/Test.csv')

train.head()

Unnamed: 0,VehicleID,Location,Maker,Model,Year,Colour,Amount (Million Naira),Type,Distance
0,VHL12546,Abuja,Honda,Accord Coupe EX V-6,2011,Silver,2.2,Nigerian Used,
1,VHL18827,Ibadan,Hyundai,Sonata,2012,Silver,3.5,Nigerian Used,125000.0
2,VHL19499,Lagos,Lexus,RX 350,2010,Red,9.2,Foreign Used,110852.0
3,VHL17991,Abuja,Mercedes-Benz,GLE-Class,2017,Blue,22.8,Foreign Used,30000.0
4,VHL12170,Ibadan,Toyota,Highlander,2002,Red,2.6,Nigerian Used,125206.0


**The dataframe below gives a description of the features in the dataset**

In [17]:
pd.set_option('display.max_colwidth', None)    # displays full information

variables = pd.read_csv('../datasets/VariableDefinitions.csv')
variables

Unnamed: 0,VehicleID,This is the unique identifier of the car.
0,Location,This is the location in Nigeria where the seller is based.
1,Maker,This is the manufacturer of the car. It is the brand name.
2,Model,This is the the name of the car product within a range of similar car products.
3,Year,This is the year the car was manufactured.
4,Colour,This is the colour of the car.
5,Amount (Million Naira),This is the selling price of the car. It is the amount the company will sell the car.
6,Type,"This is the nature of previous use of the car, whether it was previously used within Nigeria or outside Nigeria."
7,Distance,This is the mileage of the car. It is how much distance it covered in its previous use


In [18]:
dfs = [train, test]
for df in dfs:
    print(f'Shape of data is {df.shape}')

Shape of data is (7205, 9)
Shape of data is (2061, 8)


In [19]:
for df in dfs:
    print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7205 entries, 0 to 7204
Data columns (total 9 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   VehicleID               7205 non-null   object 
 1   Location                7205 non-null   object 
 2   Maker                   7205 non-null   object 
 3   Model                   7205 non-null   object 
 4   Year                    7184 non-null   object 
 5   Colour                  7205 non-null   object 
 6   Amount (Million Naira)  7188 non-null   float64
 7   Type                    7008 non-null   object 
 8   Distance                4845 non-null   object 
dtypes: float64(1), object(8)
memory usage: 506.7+ KB
None
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2061 entries, 0 to 2060
Data columns (total 8 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   VehicleID  2061 non-null   object 
 1   Location   2061 non-n

The train and test set contains some null value in their features... 
Taking a closer look at them

In [20]:
test.isna().sum()

VehicleID      0
Location       0
Maker          0
Model          0
Year           2
Colour         0
Type          54
Distance     676
dtype: int64

In [21]:
train.isna().sum()

VehicleID                    0
Location                     0
Maker                        0
Model                        0
Year                        21
Colour                       0
Amount (Million Naira)      17
Type                       197
Distance                  2360
dtype: int64

**Dropping null values**

Null values are contained mostly in the distance column and replacing them with the mean can add bias to our model

In [24]:
for df in dfs:
    df.dropna(inplace=True, axis=1)

In [26]:
for df in dfs:
    print(df.isna().sum())

VehicleID    0
Location     0
Maker        0
Model        0
Colour       0
dtype: int64
VehicleID    0
Location     0
Maker        0
Model        0
Colour       0
dtype: int64


## ***EXPLORATION***

Now the cleaning has been taken care of we'll be taking a deeper dive into the data, column by column on how they relate to the target variables and see if we can come up with any insights that can be useful in the modelling