In [1]:
import numpy as np
import pandas
import scipy.stats
import statsmodels.formula.api as smf
import statsmodels.stats.multicomp as multi 
%matplotlib inline
import matplotlib.pyplot as plt

# Car Evaluation Database

Car Evaluation Database is part of UC Irvine Machine Learning Repository. The data can be found at this link. The dataset contains 7 attributes:
1. The class, how clients accepted the car. This attribute has 4 values: unacc, acc, good, vgood
2. buying: vhigh, high, med, low. 
3. maint: vhigh, high, med, low.
4. doors: 2, 3, 4, 5more 
5. persons: 2, 4, more. 
6. lug_boot: small, med, big. 
7. safety: low, med, high.

In week 2 we explored the relationship between the safety and whether clients accept the car. We showed that there is a relationship between safety and level of acceptance. More details can be found at [link](https://github.com/giladsa/DataAnalysisTools/blob/master/week2_cars.ipynb)

In this work we check if this fact is moderated by the cost of the car "buying".

In [2]:
url = 'http://archive.ics.uci.edu/ml/machine-learning-databases/car/car.data'
header =np.array(['buying','maint','doors','persons','lug_boot','safety','class'])
data = pandas.read_csv(url, low_memory=False,names=header)

Similar to week 2 we use a boolean value of whether the car is recommended or not

In [3]:
data['recommend'] = data['class'].map(lambda x:  'dontBuy' if (x=='unacc' or x=='acc') else 'buy')

We will do the same for the buying cost, if it is vhigh or 'high' it will be "expensive" otherwise "reasonable"

In [4]:
data['expensive'] = data['maint'].map(lambda x:  'expensive' if (x=='vhigh' or x=='high') else 'reasonable')

We will create 2 sub datasets based of whether the car is expensive

In [5]:
expensive = data[(data['expensive']=='expensive')]
reasonable = data[(data['expensive']=='reasonable')]

## Check for expensive cars

In [6]:
ct=pandas.crosstab(expensive['recommend'],expensive['safety'])
print (ct)

safety     high  low  med
recommend                
buy          13    0    0
dontBuy     275  288  288


In [7]:
cs= scipy.stats.chi2_contingency(ct)
print (cs)

(26.397179788484141, 1.8532125817174455e-06, 2, array([[   4.33333333,    4.33333333,    4.33333333],
       [ 283.66666667,  283.66666667,  283.66666667]]))


We see that for expensive cars there is a relation betweeb safety and whether clients are willing to buy the car. We will check sub-group

In [8]:
def chi2ByCat(data,recodeObj):
    data['COMP_rec']= data['safety'].map(recodeObj)
    ct=pandas.crosstab(data['recommend'], data['COMP_rec'])
    print (ct)
    colsum=ct.sum(axis=0)
    colpct=ct/colsum
    print(colpct)
    cs= scipy.stats.chi2_contingency(ct)
    print (cs)

In [9]:
recode1 = {"high": "high", "low": "low"}
chi2ByCat(expensive,recode1)

COMP_rec   high  low
recommend           
buy          13    0
dontBuy     275  288
COMP_rec       high  low
recommend               
buy        0.045139    0
dontBuy    0.954861    1
(11.332695723459487, 0.00076154271093483328, 1, array([[   6.5,    6.5],
       [ 281.5,  281.5]]))


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app


In [10]:
recode1 = recode1 = {"high": "high", "med": "med"}
chi2ByCat(expensive,recode1)

COMP_rec   high  med
recommend           
buy          13    0
dontBuy     275  288
COMP_rec       high  med
recommend               
buy        0.045139    0
dontBuy    0.954861    1
(11.332695723459487, 0.00076154271093483328, 1, array([[   6.5,    6.5],
       [ 281.5,  281.5]]))


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app


In [11]:
recode1 = {"low": "low", "med": "med"}
chi2ByCat(expensive,recode1)

COMP_rec   low  med
recommend          
dontBuy    288  288
COMP_rec   low  med
recommend          
dontBuy      1    1
(0.0, 1.0, 0, array([[ 288.,  288.]]))


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app


In conclusion, we see the same trend for high-med and high-low, thus the cost of car isn't a moderate. Regarding low-med there it isn't recommend to buy such car anyhow, regardless to the cost

## Reasonable cost

In [12]:
ct=pandas.crosstab(reasonable['recommend'],reasonable['safety'])
print (ct)
cs= scipy.stats.chi2_contingency(ct)
print (cs)


safety     high  low  med
recommend                
buy          82    0   39
dontBuy     206  288  249
(97.006951937087734, 8.6140628022603454e-22, 2, array([[  40.33333333,   40.33333333,   40.33333333],
       [ 247.66666667,  247.66666667,  247.66666667]]))


Conclusion, low cost car isn't a moderate factor. We see that also for not-expensive car safety is a major factor, regardless to the cost.

In [13]:
recode1 = {"high": "high", "low": "low"}
chi2ByCat(reasonable,recode1)

COMP_rec   high  low
recommend           
buy          82    0
dontBuy     206  288
COMP_rec       high  low
recommend               
buy        0.284722    0
dontBuy    0.715278    1
(93.293571640169844, 4.5085406739280325e-22, 1, array([[  41.,   41.],
       [ 247.,  247.]]))


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app


In [14]:
recode1 = recode1 = {"high": "high", "med": "med"}
chi2ByCat(reasonable,recode1)

COMP_rec   high  med
recommend           
buy          82   39
dontBuy     206  249
COMP_rec       high       med
recommend                    
buy        0.284722  0.135417
dontBuy    0.715278  0.864583
(18.455435473617293, 1.7392405653996703e-05, 1, array([[  60.5,   60.5],
       [ 227.5,  227.5]]))


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app


In [15]:
recode1 = {"low": "low", "med": "med"}
chi2ByCat(reasonable,recode1)

COMP_rec   low  med
recommend          
buy          0   39
dontBuy    288  249
COMP_rec   low       med
recommend               
buy          0  0.135417
dontBuy      1  0.864583
(39.714654061022777, 2.9391191640380984e-10, 1, array([[  19.5,   19.5],
       [ 268.5,  268.5]]))


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app


Conclusion safety affects clients willingness to buy a car. The cost isn't a moderated factor