<a href="https://colab.research.google.com/github/WoradeeKongthong/raining_tomorrow_classification/blob/master/04_Raining_SVM.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Based on Feature Engineering in part 02  
Outliers : I'll cap the outliers in X_train. And cap the outliers in X_test using the boundaries of X_train.  
Missing values : I'll impute the missing values in categorical features with 'most frequent' value,  
and impute the missing values in numerical features with median.

In [0]:
# libraries
import pandas as pd
import numpy as np

from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler, MinMaxScaler
from sklearn.compose import ColumnTransformer, make_column_transformer
from sklearn.pipeline import Pipeline

from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report

# **Data Set**

In [0]:
df = pd.read_csv('https://raw.githubusercontent.com/WoradeeKongthong/raining_tomorrow_classification/master/weatherAUS.csv')

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 142193 entries, 0 to 142192
Data columns (total 24 columns):
 #   Column         Non-Null Count   Dtype  
---  ------         --------------   -----  
 0   Date           142193 non-null  object 
 1   Location       142193 non-null  object 
 2   MinTemp        141556 non-null  float64
 3   MaxTemp        141871 non-null  float64
 4   Rainfall       140787 non-null  float64
 5   Evaporation    81350 non-null   float64
 6   Sunshine       74377 non-null   float64
 7   WindGustDir    132863 non-null  object 
 8   WindGustSpeed  132923 non-null  float64
 9   WindDir9am     132180 non-null  object 
 10  WindDir3pm     138415 non-null  object 
 11  WindSpeed9am   140845 non-null  float64
 12  WindSpeed3pm   139563 non-null  float64
 13  Humidity9am    140419 non-null  float64
 14  Humidity3pm    138583 non-null  float64
 15  Pressure9am    128179 non-null  float64
 16  Pressure3pm    128212 non-null  float64
 17  Cloud9am       88536 non-null

In [0]:
# drop RISK_MM column (Recommendation from data description in Kaggle)
df.drop(['RISK_MM'], axis = 1, inplace = True)

# Extract Year, Month, Day from Date column
df['Date'] = pd.to_datetime(df['Date'])
df['Year'] = df['Date'].dt.year
df['Month'] = df['Date'].dt.month
df['Day'] = df['Date'].dt.day

# drop Date column
df.drop(['Date'], axis = 1, inplace = True)

# select year 2015-2017 to train the model
df = df[df['Year'] > 2014]

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 43205 entries, 2109 to 142192
Data columns (total 25 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Location       43205 non-null  object 
 1   MinTemp        43002 non-null  float64
 2   MaxTemp        43081 non-null  float64
 3   Rainfall       42799 non-null  float64
 4   Evaporation    19472 non-null  float64
 5   Sunshine       16085 non-null  float64
 6   WindGustDir    40818 non-null  object 
 7   WindGustSpeed  40837 non-null  float64
 8   WindDir9am     40515 non-null  object 
 9   WindDir3pm     41457 non-null  object 
 10  WindSpeed9am   42990 non-null  float64
 11  WindSpeed3pm   41715 non-null  float64
 12  Humidity9am    42696 non-null  float64
 13  Humidity3pm    40781 non-null  float64
 14  Pressure9am    38536 non-null  float64
 15  Pressure3pm    38533 non-null  float64
 16  Cloud9am       25312 non-null  float64
 17  Cloud3pm       22877 non-null  float64
 18  Te

In [0]:
X = df.drop(['RainTomorrow'], axis=1)
y = df['RainTomorrow']

In [0]:
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2)

# **Handle the Outliers**
Training Set
- cap the outliers in X_train

Test Set
- cap the outliers in X_test using the upper_cap and lower_cap of X_train

## **Cap the outlier in X_train**

In [8]:
Q1 = X_train.quantile(0.25)
Q3 = X_train.quantile(0.75)
IQR = Q3 - Q1

lower_cap = Q1 - 1.5*IQR
upper_cap = Q3 + 1.5*IQR

features = lower_cap.index.values

for feature in features :
  X_train.loc[:,feature] = np.where(X_train.loc[:,feature]<lower_cap[feature],lower_cap[feature], X_train.loc[:,feature])
  X_train.loc[:,feature] = np.where(X_train.loc[:,feature]>upper_cap[feature],upper_cap[feature], X_train.loc[:,feature])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[item] = s


## **Cap the outlier in X_test**

In [9]:
for feature in features :
  X_test.loc[:,feature] = np.where(X_test.loc[:,feature]<lower_cap[feature],lower_cap[feature], X_test.loc[:,feature])
  X_test.loc[:,feature] = np.where(X_test.loc[:,feature]>upper_cap[feature],upper_cap[feature], X_test.loc[:,feature])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[item] = s


# **Support Vector Machine**

## **Create Preprocessor : ColumnTransformer of numerical and categorical features**

In [0]:
numerical_features = [x for x in X.columns if df[x].dtype != 'object']

numeric_transformer = Pipeline(steps=[
          ('imputer', SimpleImputer(strategy='median')),
          ('scaler', MinMaxScaler())
])

categorical_features = [x for x in X.columns if df[x].dtype == 'object']

categorical_transformer = Pipeline(steps=[
          ('imputer', SimpleImputer(strategy='most_frequent')),
          ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

preprocessor = ColumnTransformer(
    transformers=[
          ('num', numeric_transformer, numerical_features),
          ('cat', categorical_transformer, categorical_features)
    ]
)

## **Create model**

In [0]:
from sklearn.svm import SVC
model = SVC(kernel = 'linear')

## **Create Pipeline**

In [0]:
clf = Pipeline(steps=[
      ('preprocessor', preprocessor),
      ('model', model)
])

# **Model Evaluation**

## **Cross Validation**

In [13]:
accuracy = cross_val_score(clf,X_train,y_train,cv=10)
print('accuracy : ', accuracy)
print('mean : ', accuracy.mean())
print('std : ', accuracy.std())

accuracy :  [0.84408447 0.8501591  0.84842349 0.84582008 0.8443287  0.85069444
 0.84143519 0.84375    0.84056713 0.84809028]
mean :  0.8457352868307997
std :  0.00332849389821565


Note : low bias and low variance

## **Training and Test evaluation**

In [14]:
clf.fit(X_train,y_train)
y_pred = clf.predict(X_test)

print('\n\nTraining and Test Sets result')
print('\naccuracy score : ', accuracy_score(y_test,y_pred))
print('\nconfusion matrix : \n', confusion_matrix(y_test, y_pred))
print('\nclassification report : \n', classification_report(y_test,y_pred))

print('Training set score : ',clf.score(X_train,y_train))
print('Test set score : ',clf.score(X_test,y_test))



Training and Test Sets result

accuracy score :  0.8464298113644254

confusion matrix : 
 [[6476  276]
 [1051  838]]

classification report : 
               precision    recall  f1-score   support

          No       0.86      0.96      0.91      6752
         Yes       0.75      0.44      0.56      1889

    accuracy                           0.85      8641
   macro avg       0.81      0.70      0.73      8641
weighted avg       0.84      0.85      0.83      8641

Training set score :  0.8470084480962852
Test set score :  0.8464298113644254


Note : the model is not overfitting

# **Improve the model**
by kernel SVM in the next part