In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC 

In [None]:
train = pd.read_csv("https://raw.githubusercontent.com/aiwei/inst414-21s/main/kaggle/train.csv")

In [None]:
train.head(5)

Unnamed: 0,label,text
0,1,"Henry Thomas showed a restraint, even when the..."
1,1,"This movie starts out brisk, has some slow mom..."
2,1,Castle of Blood is a good example of the quali...
3,1,I viewed the movie together with a homophobic ...
4,1,"The ""Men in White"" movie is definitely one of ..."


In [None]:
test = pd.read_csv("https://raw.githubusercontent.com/aiwei/inst414-21s/main/kaggle/test.csv")

In [None]:
test.head(5)

Unnamed: 0,Id,text
0,0,I cannot believe I actually sat through the wh...
1,1,I saw this one remastered on DVD. It had a big...
2,2,"Irrespective of the accuracy of facts, Bandit ..."
3,3,"Significant Spoilers! This is a sick, disturbi..."
4,4,If there are people that don't like this movie...


**TfidfVectorizer:** allows us to implement weighting to our terms in our movie reviews. 

**Paramters:**
- **sublinear_tf = True**: applies sublinear tf scaling and changes my tf to 1 + log(tf).
- **analyzer = 'word'**: makes the feature extracted a word n_gram.
- **lowercase = True**: converts all characters to lowercase before tokenizing.
- **max_df = .7**: means ignore terms that occur in more than 70% of the documents.
- **ngram_range = (1,3)**: features include unigrams, bigrams, and trigrams.

In [None]:
tfid_vect = TfidfVectorizer(sublinear_tf=True,analyzer='word',lowercase=True,max_df=.7,ngram_range=(1,3))

In [None]:
X_train = tfid_vect.fit_transform(train.text)

In [None]:
X_test = tfid_vect.transform(test.text)

In [None]:
y_train = train.label

**The following line of code implements our additional train-test-split to allow us to perform better parameter tuning and feature selection.**


In [None]:
X_train_2, X_test_2, y_train_2, y_test_2 = train_test_split(X_train, y_train, random_state = 5)

**Linear SVC Parameters:**
- **C = 1e20:** sets our regularization parameters to 1e20 to create a smaller-margin hyperplane.
- **max_iter = 10000:** sets our max number of iterations ran to 10,000.

**Testing our model with our new training and testing sets.**

**Note:** The accuracy score projected here differs slightly from the accuracy score on Kaggle. Due to the second train-test-split, our testing and training sets are smaller and are not comparable to the original testing set. I used this score to see whether different parameters increased or decreased the projected score.

In [None]:
svc_test = LinearSVC(C=1e20,max_iter=10000).fit(X_train_2, y_train_2)
svc_test_score = svc_test.score(X_test_2,y_test_2)
print('Projected Accuracy:',svc_test_score)

Projected Accuracy: 0.8916


**Final Linear SVC model trained using our training data. Creates predictions using our X_test.**

In [None]:
final_svc = LinearSVC(C=1e20,max_iter=10000).fit(X_train,y_train)
svc_prediction = final_svc.predict(X_test)

**Writes our predictions into a Pandas Dataframe and presents the first 10 rows.**

In [None]:
prediction_df = pd.DataFrame({"Id": test.Id, "Category": svc_prediction})

In [None]:
prediction_df.head(10)

Unnamed: 0,Id,Category
0,0,0
1,1,0
2,2,1
3,3,0
4,4,1
5,5,0
6,6,1
7,7,0
8,8,1
9,9,1


**Writes our Pandas Dataframe to a csv file labeled, "Dillon_Morley_Final.csv"**

In [None]:
prediction_df.to_csv("Dillon_Morley_Final.csv", index=False)