# Vehicle Dataset

### In this project, I analysed a dataset about vehicles and their characteristics.

The goal of this project was to preprocess and analyse the data using the pandas library.
Throughout this assignment I learnt and recapped the following:
- Finding the max, mean on columns
- Finding the total number of values remaining after a condition as been placed on a column
- Using dropna() across columns to drop missing values. Being careful to copy a new df, so the original dataframe is not overwritten and not applying (inplace=True) for the given example.
- Using the .mode() method on a column, but acknowleding there may be more than one mode, using the [0] index to select the first item in the array.
- Creating a new column that uses string concatenation to enter new values and hadcoded values into the new column, and then turing the column data type into a string.
- Create a new df with a new index (in this case it was the vehicles name and year column). I then created a function whereby you could enter the vehicles name and year as an argument and it would return the acceleration of that specific vehicle.

In [1]:
import pandas as pd

First, we will load the dataset from `data/cars.csv` into a DataFrame.

In [2]:
df = pd.read_csv('data/cars.csv')
df.head()

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model_year,origin,name
0,18.0,8,307.0,130.0,3504,12.0,70,usa,chevrolet chevelle malibu
1,15.0,8,350.0,165.0,3693,11.5,70,usa,buick skylark 320
2,18.0,8,318.0,150.0,3436,11.0,70,usa,plymouth satellite
3,16.0,8,304.0,150.0,3433,12.0,70,usa,amc rebel sst
4,17.0,8,302.0,140.0,3449,10.5,70,usa,ford torino


## Dataset stats

#### 1. What's the mean of the values in the `weight` column?

Store the answer in a variable called `mean_weight`

In [20]:
mean_weight = df['weight'].mean()
mean_weight

2973.946835443038

#### 2. What's the maximum value in the `horsepower` column?

Store the answer in a variable called `max_horsepower`

In [22]:

max_horsepower = df['horsepower'].max()
max_horsepower

230.0

#### 3. How many cars have a `weight` of equal to or greater than 3500 ?

Store the answer in a variable called `heavy_cars`

In [23]:
# Add your code below
heavy_cars = len(df[df['weight'] > 3500])
heavy_cars

109

#### 4. Create a new DataFrame with an additional column called `ratio`, which equals `horsepower` divided by `weight`

Call the new DataFrame `df_ratio`

In [6]:

df_ratio = df.copy()

df_ratio['ratio'] = (df_ratio['horsepower'] / df_ratio['weight'])
df_ratio.head()

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model_year,origin,name,ratio
0,18.0,8,307.0,130.0,3504,12.0,70,usa,chevrolet chevelle malibu,0.0371
1,15.0,8,350.0,165.0,3693,11.5,70,usa,buick skylark 320,0.044679
2,18.0,8,318.0,150.0,3436,11.0,70,usa,plymouth satellite,0.043655
3,16.0,8,304.0,150.0,3433,12.0,70,usa,amc rebel sst,0.043694
4,17.0,8,302.0,140.0,3449,10.5,70,usa,ford torino,0.040591


## Dataset sorting and filtering

#### 5. Create a new DataFrame containing only cars with an `origin` of 'usa'

We'll start with a copy of the original DataFrame to avoid modifying the original. Call the new DataFrame `df_usa`

In [25]:
df_usa = df.copy()

df_usa = df_usa[df_usa['origin'] == 'usa']
df_usa.head()

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model_year,origin,name
0,18.0,8,307.0,130.0,3504,12.0,70,usa,chevrolet chevelle malibu
1,15.0,8,350.0,165.0,3693,11.5,70,usa,buick skylark 320
2,18.0,8,318.0,150.0,3436,11.0,70,usa,plymouth satellite
3,16.0,8,304.0,150.0,3433,12.0,70,usa,amc rebel sst
4,17.0,8,302.0,140.0,3449,10.5,70,usa,ford torino


#### 6. What's the mean `mpg` of cars of origin `usa`?

Remember that we can use the `df_usa` DataFrame just created, which only contains these cars.

Store your answer in a variable called `mean_mpg_usa`

In [26]:

mean_mpg_usa = df_usa['mpg'].mean()
mean_mpg_usa

20.04308943089431

#### 7. How many cars of origin `usa` have 8 `cylinders` ?

Store your answer in a variable called `eight_cyl_usa`

In [27]:
eight_cyl_usa = len(df_usa[df_usa['cylinders'] == 8])
eight_cyl_usa

103

We can see from `df.info()` that we have some missing values in the `horsepower` column.

#### 8. create a new DataFrame (from the original `df`) which does not contain the rows with a missing value

Call the new DataFrame `df_horsepower`

In [10]:
df_horsepower = df.copy()


df_horsepower = df_horsepower.dropna()
df_horsepower.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 390 entries, 0 to 394
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   mpg           390 non-null    float64
 1   cylinders     390 non-null    int64  
 2   displacement  390 non-null    float64
 3   horsepower    390 non-null    float64
 4   weight        390 non-null    int64  
 5   acceleration  390 non-null    float64
 6   model_year    390 non-null    int64  
 7   origin        390 non-null    object 
 8   name          390 non-null    object 
dtypes: float64(4), int64(3), object(2)
memory usage: 30.5+ KB


#### 9. What's the first (or only) mode value for `horsepower` in `df_horsepower`?

Store your answer in a variable called `mode_hp`

*Hint: i.e. the value found using the `.mode()` method on the given column; note that because there may be more than one mode, the method returns an array. We can access the first value using `[0]`, like we would with a list.*

In [11]:

mode_hp = df_horsepower['horsepower'].mode()[0]
mode_hp

150.0

#### 10. Create a DataFrame containing only cars with a horsepower greater than or equal to `mode_hp` in `df_horsepower`

Call the new DataFrame `df_high_hp`

In [12]:
df_high_hp = df_horsepower.copy()


df_high_hp = df_horsepower[df_horsepower['horsepower'] >= mode_hp]
df_high_hp


# df_new = df[df['Pid'] == 'p01']

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model_year,origin,name
1,15.0,8,350.0,165.0,3693,11.5,70,usa,buick skylark 320
2,18.0,8,318.0,150.0,3436,11.0,70,usa,plymouth satellite
3,16.0,8,304.0,150.0,3433,12.0,70,usa,amc rebel sst
5,15.0,8,429.0,198.0,4341,10.0,70,usa,ford galaxie 500
6,14.0,8,454.0,220.0,4354,9.0,70,usa,chevrolet impala
...,...,...,...,...,...,...,...,...,...
228,15.5,8,350.0,170.0,4165,11.4,77,usa,chevrolet monte carlo landau
229,15.5,8,400.0,190.0,4325,12.2,77,usa,chrysler cordoba
261,17.7,6,231.0,165.0,3445,13.4,78,usa,buick regal sport coupe (turbo)
287,16.9,8,350.0,155.0,4360,14.9,79,usa,buick estate wagon (sw)


#### 11. What percentage of the cars in `df_high_hp` have 8 `cylinders`?

Store your answer in a variable called `percentage_eight_cyl`

Your answer should be a float, and should be for example 56.0 rather than 0.56 for 56%.

In [13]:

percentage_eight_cyl = (len(df_high_hp[df_high_hp['cylinders'] == 8])) / (len(df_high_hp['cylinders'])) * 100
percentage_eight_cyl




98.50746268656717

## Dataset manipulation

We can see from the output below that some car names have more than one entry in the DataFrame:

In [14]:
df['name'].value_counts()

toyota corolla         5
amc matador            5
ford maverick          5
toyota corona          4
chevrolet chevette     4
                      ..
chevrolet monza 2+2    1
ford mustang ii        1
pontiac astro          1
amc pacer              1
chevy s-10             1
Name: name, Length: 306, dtype: int64

#### 12. Add a column called  `name_year` to a copy of  `df`, with each entry containing a string in the following format:

    name + ' - 19' + model_year

So for example, `'chevrolet chevelle malibu - 1970'`

Call the new DataFrame `df_name`

*Hint: you may find the .astype() method useful*

In [28]:
df_name = df.copy()


df_name['name_year'] = df_name['name'] + ' - 19' + df_name['model_year'].astype(str)
df_name.tail()

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model_year,origin,name,name_year
390,27.0,4,140.0,86.0,2790,15.6,82,usa,ford mustang gl,ford mustang gl - 1982
391,44.0,4,97.0,52.0,2130,24.6,82,europe,vw pickup,vw pickup - 1982
392,32.0,4,135.0,84.0,2295,11.6,82,usa,dodge rampage,dodge rampage - 1982
393,28.0,4,120.0,79.0,2625,18.6,82,usa,ford ranger,ford ranger - 1982
394,31.0,4,119.0,82.0,2720,19.4,82,usa,chevy s-10,chevy s-10 - 1982


Looking at value_counts() on the `name_year` column, we should now see that there are no duplicated entries:

In [16]:
df_name['name_year'].value_counts()

chevrolet chevelle malibu - 1970    1
datsun 200-sx - 1978                1
plymouth sapporo - 1978             1
toyota celica gt liftback - 1978    1
dodge omni - 1978                   1
                                   ..
ford pinto - 1974                   1
datsun b210 - 1974                  1
chevrolet nova - 1974               1
amc hornet - 1974                   1
chevy s-10 - 1982                   1
Name: name_year, Length: 395, dtype: int64

#### 13. On a copy of the `df_name` DataFrame, set the index of the DataFrame as the `name_year` column

Call you new DataFrame `df_car_index`

*Hint: if using the set_index method, either use `inplace=True` or assign the result to a variable, otherwise the new index won't be stored.*

In [30]:
df_car_index = df_name.copy()
df_car_index.set_index('name_year', inplace=True)
# df.set_index('month')
df_car_index.sample(5)


Unnamed: 0_level_0,mpg,cylinders,displacement,horsepower,weight,acceleration,model_year,origin,name
name_year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
volkswagen 1131 deluxe sedan - 1970,26.0,4,97.0,46.0,1835,20.5,70,europe,volkswagen 1131 deluxe sedan
ford maverick - 1970,21.0,6,200.0,85.0,2587,16.0,70,usa,ford maverick
plymouth horizon - 1979,34.2,4,105.0,70.0,2200,13.2,79,usa,plymouth horizon
datsun 710 - 1974,32.0,4,83.0,61.0,2003,19.0,74,japan,datsun 710
capri ii - 1976,25.0,4,140.0,92.0,2572,14.9,76,usa,capri ii


#### 14. Create a function which takes  `name_year` as the only parameter, and returns the `acceleration` for any car in `df_car_index`



In [18]:
# Add your code below
def acceleration(name_year):
    return df_car_index['acceleration'].loc[name_year]
    
# def f(a):
#     return df.loc[df['Col 1'] == a, 'Col 2'].item()    

You can test your function using the following cell:

In [19]:
acceleration('ford torino - 1970')

10.5