## Summary Statistics of Automobile Evaluation Data

In this project we will analyze a sample from a popular open source dataset. This dataset contains information on the cost and physical attributes of several thousand cars. Originally, this dataset was used for to train a classification model that assigned an acceptability score/category to cars based on these attributes.

The car evaluation dataset has been sourced from the UCI Machine Learning Repository and has been slightly modified for this project. Specifically, one additional field manufacturer_country has been simulated for illustrative purposes.More information about this dataset acn be found here https://archive.ics.uci.edu/ml/datasets/car+evaluation.

In [1]:
import pandas as pd
import numpy as np

In [2]:
# let's load the dataset and investigate.

car_eval = pd.read_csv(r"C:\Users\amanp\OneDrive\Desktop\car_eval.csv")

In [3]:
car_eval.head()

Unnamed: 0,buying_cost,maintenance_cost,doors,capacity,luggage,safety,acceptability,manufacturer_country
0,vhigh,low,4,4,small,med,unacc,China
1,vhigh,med,3,4,small,high,acc,France
2,med,high,3,2,med,high,unacc,United States
3,low,med,4,more,big,low,unacc,United States
4,low,high,2,more,med,high,acc,South Korea


### Summarizing Manufacturing Country


Manufacturer_country is a nominal categorical variable that indicates the country of the manufacturer of each car reviewed.
Let's create a table of frequencies of all the cars reviewed by "manufacturer_country".

In [4]:
 # `.value_counts()` produces a table of frequencies in order, we can reference the N-th row of this table to find the Nth most 
# common value in the data. 

car_eval.manufacturer_country.value_counts()

Japan            228
Germany          218
South Korea      159
United States    138
Italy             97
France            87
China             73
Name: manufacturer_country, dtype: int64

Let's calculate a table of proportions for countries that appear in "manufacturer_country" in the dataset.

In [5]:
 # We can do this by using normalize w. `.value_counts()` normalizes the table of frequencies that `.value_counts() 
# produces by default to a table of proportions. A table of frequencies takes the count of observations, a table of proportions 
# takes the proportion each value represents of the total. 

car_eval.manufacturer_country.value_counts(normalize=True)

Japan            0.228
Germany          0.218
South Korea      0.159
United States    0.138
Italy            0.097
France           0.087
China            0.073
Name: manufacturer_country, dtype: float64

### Summarizing Buying Costs

"buying_cost" is a categorical variable which describes the cost of buying any car in the dataset. Let's print out a list of the possible values for this variable.

In [6]:
# The `.unique()` method strips all unique values from a column. 

car_eval["buying_cost"].unique()

array(['vhigh', 'med', 'low', 'high'], dtype=object)

Buying_cost is an ordinal categorical variable, which means we can create an order associated with the values in the data and perform additional numeric operations on the variable.
Let's create a list of the unique categories in from lowest to highest cost in the "buying_cost" variable.

In [7]:
buying_cost_categories = ['low', 'med', 'high', 'vhigh']
print(buying_cost_categories)

['low', 'med', 'high', 'vhigh']


Now, let's convert `buying_cost` to type `'category'` using the order created above.

In [8]:
# We can convert a field to type category using the function `pandas.Categorical()`. The pandas categorical type allows us to 
# perform numeric operations on categorical data.

# We can also check the column has type category by checking `print(car_eval.buying_cost)`.

car_eval["buying_cost"] = pd.Categorical(
    car_eval["buying_cost"],
    buying_cost_categories,
    ordered=True
)

car_eval.buying_cost

0      vhigh
1      vhigh
2        med
3        low
4        low
       ...  
995      low
996      low
997    vhigh
998      low
999      low
Name: buying_cost, Length: 1000, dtype: category
Categories (4, object): ['low' < 'med' < 'high' < 'vhigh']

Finally, let's calculate the median category of the `buying_cost` variable.

In [9]:
# In Python, we can use `np.median()` to calculate the median value of a numerical series. In this case, we also must access 
# the numerical values of the categories. This can be done with the `.cat.codes` attribute. 


median_category_num = np.median(car_eval['buying_cost'].cat.codes)
print(median_category_num) 

median_category = buying_cost_categories[int(median_category_num)]
print(median_category)

1.0
med


### Summarizing Luggage Capacity


Luggage is a categorical variable in the car evaluations dataset that records the luggage capacity for each reviewed car.
Let's calculate a table of proportions for 'luggage'.

In [10]:
car_eval.luggage.value_counts(normalize=True)

small    0.339
med      0.333
big      0.328
Name: luggage, dtype: float64

Without passing `normalize = True` to `.value_counts()`, let's replicate the result we got above.

In [11]:
# This method relies on `luggage` having no null values. 
# If a field does have nulls the more robust solution is, using the `.count()` method excludes NULLs in the denominator just 
# as `.value_counts(normalize=True)` does

print(car_eval.luggage.value_counts()/len(car_eval.luggage))

# Safe alternative if there are Nulls:

print(car_eval.luggage.value_counts()/car_eval.luggage.count())

small    0.339
med      0.333
big      0.328
Name: luggage, dtype: float64
small    0.339
med      0.333
big      0.328
Name: luggage, dtype: float64


### Summarizing Passenger Capacity

Doors is a categorical variable in the car evaluations dataset that records the count of doors for each reviewed car. 
Let's find the frequency and proportion of cars that have 5 or more doors. 

In [12]:
# We must first create a series that evaluates to true/false values. 
# Then, by calling .sum() and .mean() on this series the value and count of 
# `True`/1 values are calculated


frequency = (car_eval.doors == '5more').sum()
proportion = (car_eval.doors == '5more').mean()
print(frequency)
print(proportion)

246
0.246


### Conclusion:
In this project, we summarized the following features from the Automobile data:
1. Table of proportions for the country of the manufacturer of each car.
2. Median category of the buying_cost variable, which came out to be 'mediam'.
3. Table of proportions for the luggage capacity of small, mediam and big cars, which came out to be 33.9%, 33.3% and 32.8% respectively.
4. Propotion of cars that have 5 or more doors, that come out to be 24.6%.