Step 1: Importing the libraries

In [1]:
import numpy as np
import pandas as pd

Step 2: Importing dataset

In [None]:
dataset = pd.read_csv('Data.csv')
X = dataset.iloc[ : , :-1].values
Y = dataset.iloc[ : , 3].values

#  -----------------------------------------------------------

xl = pd.ExcelFile("dognition_data_aggregated_by_dogid.xlsx")
# pd.read_excel()
# xlrd is used by pandas
xl.sheet_names
print 'Sheet name of the excel file:', xl.sheet_names[0]
df = xl.parse("dog_id_max_ranks")
# produce a dataframe
# df.head()


# For each column, show the number of nan values

# Search for the nan values in each column
print df.isnull().sum()
# df.isnull() is also a dataframe, and then sum() by columns


# Search for the nan values in each row

num = np.sum(df.isnull().values, axis = 1).tolist()
l1 = list(set(num))
l1.sort()
fre = {i:num.count(i) for i in l1}
dt = pd.DataFrame(fre.items(), columns=['Number of NaNs', 'Count'])
print dt


Step 3: Handling the missing data and outliers

如果标出的是None，可以不去处理；也可以把整个样本除去；也可以推测一个合理的值，但是，对于时间序列，不能用未来倒推过去

更难处理的missing data是，标出来是有数据的，但出错了， 也就是说，是outlier，需要检测出来，检测出来以后，标成missing data

outlier是假数据、出错的数据，但本质上是没法用当前模型理解、预测的数据。比如预测股价，股价当然是真实的，但如果没法用我们的模型去理解和预测，也要当成outlier，不懂不碰，不纳入我们需要理解的数据，否则反而会对模型造成干扰。比如用统计模型取理解股价，统计模型要求数据的可重复性要高，得是大样本数据才行，那么，对于极端数据，在统计模型看来就是outlier（因为过不了统计检验，对统计模型是没用的），但是，如果我们有一种很强的机制模型可以解释预测这些outlier的话，这些点就应该纳入模型，不再当做outlier。

qq plot是用来找outlier的一种办法，比如理论上是normal distribution，但用qq plot看出分布肥尾的话，就要考虑肥尾是不是outlier。

We can also use box plot to detect outliers for continuous variables. Q1 is the first quartile and Q3 is the third quartile. Any value, which is beyond the range of Q1-1.5 x IQR to Q3+1.5 x IQR, will be regarded an outlier. IQR is interquartile range. 这样做的前提是，我们的模型要求数据是正态分布，我们假设，数据点的分布符合正态分布，这个前提下，the range of Q1-1.5 x IQR to Q3+1.5 x IQR之外的概率只有1-50%-2*24.65% = 0.6%. 如果数据量下的话，不应该出现，所以是错误数据，是outlier.

Another criteria is: data points, three or more standard deviation away from mean are considered outlier. Outlier detection is merely a special case of the examination of data for influential data points and it also depends on the business understanding.

PCA can also be used in outlier detection.


In [None]:
from sklearn.preprocessing import Imputer
imputer = Imputer(missing_values = "NaN", strategy = "mean", axis = 0)
imputer = imputer.fit(X[ : , 1:3])
X[ : , 1:3] = imputer.transform(X[ : , 1:3])

# ----------------------------------------------------------

db = df.drop(df.index[[num.index(22)]])
# Simply handling the missing data


Step 4: drop unnecessary columns and convert type for some columns

In [None]:
# step 1: drop unnecessary columns
fields_to_drop = ['instant', 'dteday', 'atemp', 'workingday',
                  'casual', 'registered']
cnt_df = cnt_df.drop(fields_to_drop, axis=1)
# step 2: type converting
fields_to_convert = ['season', 'yr', 'mnth', 'hr', 'holiday', 'weekday',
                     'weathersit']
for field in fields_to_convert:
    # cnt_df[field] = cnt_df[field].astype('object')
    cnt_df[field] = cnt_df[field].astype('str')

Step 5: Encoding categorical data if necessary

In [None]:
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelencoder_X = LabelEncoder()
X[ : , 0] = labelencoder_X.fit_transform(X[ : , 0])

# Creating a dummy variable
# a dummy variable is a categorical variable using 0 and 1

onehotencoder = OneHotEncoder(categorical_features = [0])
X = onehotencoder.fit_transform(X).toarray()
labelencoder_Y = LabelEncoder()
Y =  labelencoder_Y.fit_transform(Y)

# ----------------------------------------------------------------

db2 = db.copy()
for y in db2.columns:
    if db2[y].dtype == object:
        db2[y] = db2[y].astype('category')
db2['Dog_Fixed'] = db2['Dog_Fixed'].astype('category') 
db2['DNA_Tested'] = db2['DNA_Tested'].astype('category')
db2['Subscribed'] = db2['Subscribed'].astype('category')
# categorical data in pandas is like factor data in R

# change unreasonable values in the data

db2.loc[db2['Weight'] == 0.0,'Weight'] = 0.1
# first check the distribution of Weight, and other int and float columns

db2.loc[db2['Max_Dogs'] == 0.0, 'Max_Dogs'] = 1.0
# first check the distribution of Max_Dogs

# convert the column below to ordered category

db2['Last_Active_At'] = db2['Last_Active_At'].cat.as_ordered()
print db2.dtypes
print

print 'The columns of ordered category:'
for i in db2.columns:
    if db2[i].dtype.name == 'category' and db2[i].cat.ordered == True:
        print i


Step 6: log transformation and feature Scaling for continuous variables

For continuous variables, if it is not normal distributuion and has skewed distributions, or if there are many outliers, we will try log (natural log or log10) scale transformation or square root scale transformation. We can try log10(x), sqrt(x), or log10(x+1), sqrt(x+1), after we do this data transform, previous outliers may be proven not at all (we can draw boxplot again with the transformed data). In this case, instead of using x, we will use log10(x) in subsequent modeling process.

对于log transformation，数据分布通常是大量数据集中在很小的区域，通过取log，这个很小的区域就能够扩大，看的清楚一些。


In [None]:
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.fit_transform(X_test)

#-------------------------------------------------------

from ggplot import *

p = ggplot(db2, aes(x='Total Tests Completed'))
# p + geom_histogram()
# p + geom_histogram(binwidth=1)
p + geom_histogram(bins=20)

p = ggplot(db2, aes(x='Gender',y='Total Tests Completed'))
# p + geom_violin(alpha = .75)
p + geom_violin()

import seaborn as sns
# seaborn.__version__
sns.violinplot(x=db2['Gender'], y=db2['Total Tests Completed'], inner=None, color="white", 
               cut=0)
sns.stripplot(x=db2['Gender'], y=db2['Total Tests Completed'], jitter=.3,  color="black", 
              alpha=.1, size=4)

sns.set_style("whitegrid")
g=sns.boxplot(x=db2['Total Tests Completed'])

temp = np.log10(db2['Total Tests Completed'])
sns.boxplot(x=temp)

sns.boxplot(x="Gender", y="Total Tests Completed", data=db2)


Step 7: Splitting the datasets into training sets and Test sets

In [None]:
from sklearn.cross_validation import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split( X , Y , test_size = 0.2, random_state = 0)