## SUMMARY STATISTICS

Summarizing Automobile Evaluation Data


In the following project you’ll use what you’ve learned about summarizing categorical data to analyze a sample from a popular open source dataset. This dataset contains information on the cost and physical attributes of several thousand cars. Originally, this dataset was used for to train a classification model that assigned an acceptability score/category to cars based on these attributes.

The car evaluation dataset has been sourced from the UCI Machine Learning Repository and has been slightly modified for this project. Specifically, one additional field manufacturer_country has been simulated for illustrative purposes. You can read more about the details, features, and original uses of this dataset in research on the UCI data description page.

In [2]:
import pandas as pd
import numpy as np


In [5]:
car_eval = pd.read_csv(r'E:\Codecademy\Summarizing_Auto_Eval_Data\car_eval.csv')
print(car_eval.head())

  buying_cost maintenance_cost doors capacity luggage safety acceptability  \
0       vhigh              low     4        4   small    med         unacc   
1       vhigh              med     3        4   small   high           acc   
2         med             high     3        2     med   high         unacc   
3         low              med     4     more     big    low         unacc   
4         low             high     2     more     med   high           acc   

  manufacturer_country  
0                China  
1               France  
2        United States  
3        United States  
4          South Korea  


**Finding the manufacturer frequency and the fourth most frequent manufacturer:**

In [11]:
mfc = car_eval['manufacturer_country'].value_counts()
print("The frequency of manufacturer countries is as follows:")
mfc


The frequency of manufacturer countries is as follows:


Japan            228
Germany          218
South Korea      159
United States    138
Italy             97
France            87
China             73
Name: manufacturer_country, dtype: int64

In [10]:
print(f"The fourth most frequent manufacturer is the {mfc.index[3]}")

The fourth most frequent manufacturer is the United States


**Calculating a table of proportions for countries that appear in the manufacturer_country dataset**

In [19]:
print("The proportion of manufacturer countries is as follows:")
mfc_prop = car_eval['manufacturer_country'].value_counts(normalize=True).reset_index()
mfc_prop.columns =['Country', 'Prop']
mfc_prop

The proportion of manufacturer countries is as follows:


Unnamed: 0,Country,Prop
0,Japan,0.228
1,Germany,0.218
2,South Korea,0.159
3,United States,0.138
4,Italy,0.097
5,France,0.087
6,China,0.073


**Finding the percentage of cars manufactured in Japan**

In [20]:
print(f"The percentage of cars that were manufactured in Japan was {mfc_prop.loc[mfc_prop['Country']=='Japan']['Prop'].values}")

The percentage of cars that were manufactured in Japan was [0.228]


**Printing a list of possible values of buying_cost variable**

In [22]:
print(car_eval.buying_cost.unique())
print("The above indicates that buying_cost is a ordinal categorical variable, which means that we can create an order associated with the values in the data and perform additional numeric operations on the variable.")

['vhigh' 'med' 'low' 'high']
The above indicates that buying_Cost is a ordinal categorical variable, which means that we can create an order associated with the values in the data and perform additional numeric operations on the variable.


**Creating a list that contains the unique values in buying_cost ordered highest to lowest**

In [24]:
buying_cost_categories = ['low', 'med', 'high', 'vhigh']

**Converting buying_cost to type 'category'**

In [26]:
car_eval['buying_cost'] = pd.Categorical(car_eval['buying_cost'],
                                         buying_cost_categories, 
                                        ordered=True)

**Calculating the median of buying_cost variable**

In [32]:
median_buying_cost_num = np.median(car_eval['buying_cost'].cat.codes)
# print(f'The median buying cost (numerically) is {median_buying_cost_num}')
median_buying_cost_cat = buying_cost_categories[int(median_buying_cost_num)]
print(f'The median buying cost category is: {median_buying_cost_cat}')

The median buying cost category is: med


**Calculating a table of luggage capacity proportions**

In [34]:
luggage_prop = car_eval['luggage'].value_counts(normalize=True)
luggage_prop

small    0.339
med      0.333
big      0.328
Name: luggage, dtype: float64

**Seeing if there are misssing values in the luggage variable column**

In [38]:
luggage_prop_dropna = car_eval['luggage'].value_counts(normalize=True, dropna=False)
print(luggage_prop_dropna)
print("Because this table is the same as the previous, we can conclude that there are no missing values in the luggage column.")

small    0.339
med      0.333
big      0.328
Name: luggage, dtype: float64
Because this table is the same as the previous, we can conclude that there are no missing values in the luggage column.


**Replicating the previous result without passing normalize=True to value_counts**

In [40]:
print("As seen below, we can get the same result from dividing the value_counts luggage column from the overall length of the luggage column.")
luggage_prop_div = car_eval['luggage'].value_counts()/len(car_eval['luggage'])
print(luggage_prop_div)


As seen below, we can get the same result from dividing the value_counts luggage column from the overall length of the luggage column.
small    0.339
med      0.333
big      0.328
Name: luggage, dtype: float64


**Finding the count of cars with 5 or more doors**

In [49]:
frequency = (car_eval['doors'] == '5more').sum()
print(f'The frequency of cars with 5 or more doors in this dataset is {frequency}.')

The frequency of cars with 5 or more doors in this dataset is 246.


**Finding the proportion of cars with five or more doors**

In [50]:
proportion = (car_eval['doors'] == '5more').mean()
print(f'The proportion of cars with 5 or more doors in this dataset is: {proportion}.')

The proportion of cars with 5 or more doors in this dataset is: 0.246.
