# Classification

I'll use many classification models on a dataset about Kickstarter projects available on Kaggle following this [link](https://www.kaggle.com/kemical/kickstarter-projects). 
I wasn't able to upload data on github due to their dimensions.  
In this code I'm going to use the [Sci-kit Learn Package](https://scikit-learn.org/stable/): I'm going to import each instance or method in the cell where I use it in order to clarify their use.  
However, I won't comment model performances: I'll just give some information about how the AUC index and the ROC Curves are interpreted.  
This is a short guide which can be useful to quickly perform a classification task.

## Data Preparation

In order to prepare data I'm going to use a self-created class. I uploaded this object on [Github](https://github.com/fog97/Class).
I won't use a particular preprocessing for each model: I'm just going to prepare data and use them as trainig and testing.

Firstly, I delete many levels from the target variable in order to have a binary classification task.  
I limit the numer of rows too, because there are too many for my machine resources, hence I'm going to work on 25000 rows.  

In [None]:
d=data_preparation()
data=d.importer('./kickstarter.csv','r',',')
#eimino le classi del target collegate a stati temporanei
#restano 2 classi
stop=[]
for el in data.state:
  if el!='failed' and el!='successful' and el not in stop:
    stop.append(el)
for parola in stop:    
  data=data[data.state!=parola]
data1=data.iloc[:25000,]

Now I'm going to impute missing: if missingness are present, discretize categorical variables and recoding the target variable form "successfull" "failed" to 0,1 respectively.

In [None]:
colname=list(data1.columns)
Y=data1.state
num=data1.select_dtypes(include=('float','int'))
char=data1.drop(list(num),axis=1).drop('state',axis=1)
cat_clean=char.drop(['ID','name','deadline','launched','category'],axis=1)
lab=["Var","Var2","Chi_quadro","P_Value"]
numm=d.num_imputation(num)
cat=d.cat_imputation(cat_clean)
catt=d.cat_encoding(cat)

num_clean=np.delete(numm,[1,3,4,7,8],axis=1)
f=0
s=0
for st in Y:
  if st=='failed':
    f+=1
  else:
    s+=1
prior_f=round(f/len(data1[1:]),3)
prior_s=round(s/len(data1[1:]),3)
#1=Failed, 0=Successful
Yy=np.empty(1)
for el in Y:
  if el=='failed':
    Yy=np.append(Yy,1)
  else:
    Yy=np.append(Yy,0)
Yy=np.delete(Yy,0)

The last operation on data is the creation of train and test samples.  
Using 60%-40% percentages.

In [None]:
from sklearn.model_selection import train_test_split
df=np.append(num_clean,catt,axis=1)
X=df
X_train, X_test, Y_train, Y_test = train_test_split(X, Yy, test_size=0.6, random_state=0)

# Models

Now I'm ready to start with the first model.

## Naive Bayes

Naive Bayes algorithm is a way to approximate the not existing but theorically Bayesian Classifier.  
I order to do this the algorithm uses the bayesian theorem, but assuming independence between all variables used as explanatory.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
gnb = GaussianNB(priors=(prior_f,prior_s),  var_smoothing=1e-30)
NB_pred = gnb.fit(X_train, Y_train).predict(X_test)
print("Naive Bayes:Number of mislabeled points out of a total %d points : %d"
  % (X_test.shape[0], (Y_test != NB_pred).sum()))

## Decision Tree

The Decision Tree is a powerful algorthm: it divides data from the full dataset into pure nodes. It doesn't need preprocessing, I'll use preprocessed data in order to write a shorter code.  
The only weakness of this kind of algorithm is the high instability of the results.

In [None]:
from sklearn import tree
DecTree = tree.DecisionTreeClassifier()
Tree_pred = DecTree.fit(X_train, Y_train).predict(X_test)
print("Tree:Number of mislabeled points out of a total %d points : %d"
  % (X_test.shape[0], (Y_test != Tree_pred).sum()))

## Bagging

Bagging is a method that consist in using multiple algorithms of the same kind on different data samples created by bootstrap.  
In this case I'm using bagging on both **Naive Bayes**, the first code block, and **Decision Tree**. This tecnique is used to improve tree's performances by reducing its instability.

#### Naive Bayes Bagging

In [None]:
from sklearn.ensemble import BaggingClassifier
Bagging = BaggingClassifier(gnb,max_samples=0.5, max_features=0.5)
Bagging_pred = Bagging.fit(X_train, Y_train).predict(X_test)
print("Bagging:Number of mislabeled points out of a total %d points : %d"
  % (X_test.shape[0], (Y_test != Bagging_pred).sum()))

#### Decision Tree Bagging

In [None]:
from sklearn.ensemble import BaggingClassifier
Bagging_Tree = BaggingClassifier(DecTree,max_samples=0.5, max_features=0.6)
Bagging_Tree_pred = Bagging_Tree.fit(X_train, Y_train).predict(X_test)
print("Tree Baging :Number of mislabeled points out of a total %d points : %d"
  % (X_test.shape[0], (Y_test != Bagging_Tree_pred).sum()))

## SVM

Support Vector Machines are a family of algorithms very powerful but with difficult interpretation.  
This algorithm uses, in order to classify, an hyperplane. It divides the space in which observetion are represented and classifies using the distance between each observation and the hyperplan. 

In [None]:
from sklearn import svm
SVM = svm.SVC()
SVM_pred = SVM.fit(X_train, Y_train).predict(X_test)
print("SVM:Number of mislabeled points out of a total %d points : %d"
  % (X_test.shape[0], (Y_test != SVM_pred).sum()))
print('Traccio curve Roc e calcolo AUC.')

# ROC Curves

The ROC curve is a graphic in which are represented the True Positive Rate and the False Positive Rate for all the possible probabilistic thresholds, using this graphic you can understand if a model is always the best one or if it can be improved.

In [None]:
from sklearn.metrics import plot_roc_curve
import matplotlib.pyplot as plt
ax = plt.gca()
NB_disp = plot_roc_curve(gnb, X_test, Y_test,ax=ax,alpha=10)
Tree_disp = plot_roc_curve(DecTree,X_test, Y_test, ax=ax, alpha=10)
Bagging_Tree_disp = plot_roc_curve(Bagging_Tree, X_test, Y_test,ax=ax,alpha=10)
Bagging_disp = plot_roc_curve(Bagging, X_test, Y_test,ax=ax,alpha=10)
SVM_disp = plot_roc_curve(SVM, X_test, Y_test,ax=ax,alpha=10)
plt.show()
print('Precision recall Curve')
from sklearn.metrics import precision_recall_curve
from sklearn.metrics import plot_precision_recall_curve
disp = plot_precision_recall_curve(Bagging_Tree, X_test, Y_test)

Usually a good classification model has an AUC(Area Under the Curve, calculated as the area under the ROC curve) higher than 0.80. Graphically, a good AUC is represented if the curve is a lot higher than the bisector of the quadrant.  