## Multi-Class Classifier 

The goal of the following code is to create a multiclass classifier that utilizes logistic regression. The data will be fit with a "train.csv" that trains our model on appropriate features and labels. The data will then be fed new text from "test.csv" and predict labels per datapoint to indicate whether that datapoint is (0) not a movie or TV show review (1) a positive movie or TV show review or (2) a negative movie or TV show review. To run the code, please edit the paths with your own .csv paths. 

## Necessary Imports

In [16]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import LabelEncoder

## Reading in Test Data

First, we'll need to read in and analyze the training data. 

In [17]:
columns = ['ID', 'TEXT', 'LABEL']
data = pd.read_csv(r"path_to_your_csv_here",header=0,names=columns)

#to view the data format, you can uncomment the below:
#data.head()

## Training the Model

Second, we'll need to transform our 'TEXT' column from train.csv into a feature vector. Similarly, we need to convert the 'LABEL' column into a label vector. Once those vectors have been created, we can train our model using those vectors and adjust our LogisticRegression parameters as necessary.  

In [3]:
#creating features from the text
cv = CountVectorizer()
x = cv.fit_transform(data['TEXT'].apply(lambda x: np.str_(x)))

In [4]:
#creating labels from the text
le = LabelEncoder()
y = le.fit_transform(data['LABEL'])

In [5]:
#training the model
LR = LogisticRegression(max_iter=10000, random_state=42)
model = LR.fit(x,y)

## Adding in Test Data

Now that our model has been fit, we can prepare and analyze the test data. Once we have confirmed the integrity of the test data, we can create a new feature vector for the 'TEXT' column to be fed into the fitted model. 

In [9]:
data2 = pd.read_csv(r"path_to_your_csv_here")

#to view the data, you can uncomment the below:
#data2.head()

In [10]:
#creating features from test data
cv2 = CountVectorizer()
x2 = cv.transform(data2['TEXT'].apply(lambda x: np.str_(x)))

## Predictions
Now we can create our predictions for the new feature vector for the test data!

In [18]:
predictions = LR.predict(x2)

#to confirm the shape and length of the vector, you can run the following print statements:
#print(predictions)
#print(len(predictions))

## Outputting Our Results 

In order to make our results available to others, we can create a new .csv through the pandas module. In this case, since we are interested in the test.csv 'ID' column and the predictions for those IDs, we just need to create those two columns.  

In [19]:
#creating the final .csv
final_csv = pd.DataFrame({"ID":list(data2["ID"]), "LABEL":list(predictions)})

#outputting it to my desktop for saving and sharing outside of Jupyter
final_csv.to_csv(r"path_you_would_like_to_export_to",index=False)