# Part 2: Logistic Regression

### Dataset
We will use the airline reviews dataset from https://github.com/quankiquanki/skytrax-reviews-dataset

In [None]:
import pandas as pd
df = pd.read_csv('data/airline.csv')
print(df.dtypes)
df.head()

To get total number of rows and check how much missing data we have.

In [None]:
print("All rows: ", len(df))
print(df.notnull().sum(axis=0))

### Model
We want to build a model that would classify airline review as positive or negative based only on its content.

To do that, we need to extract feature data and class labels from dataset.

In [None]:
y = df['recommended']
X = df['content']

Logistic regression classifier requires numerical features - we must transform review content into numerical representation. We can use Bag-of-words representation - each review will be transformed into a numerical vector and each element of the vector will indicate if the word associated with this element is present in this review.

Let's say our dataset has three sentences:
* I am Pawel.
* He likes Python.
* Students learn a lot.

Then we can represent the vocabulary of this dataset as a following vector:
```
[I, am, Pawel, He, likes, Python, Students, learn, a, lot]
```
Every sentence can then be transformed into a vector with elements set to 1 if the word is present (no matter how many times) in the sentence or 0 otherwise.

For example:
```
I am Pawel => [1, 1, 1, 0, 0, 0, 0, 0, 0, 0]
Students learn Python => [0, 0, 0, 0, 0, 1, 1, 1, 0, 0]
```

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=5)

To transform review content into numerical features we need to create and train Count Vectorizer transformer

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(binary=True)
X_train_transformed = vectorizer.fit_transform(X_train)

Now we can use it to transform text into a _sparse_ vector of binary values.

In [None]:
print(vectorizer.transform(["He likes Python."]))
vectorizer.transform(["He likes Python."])

To train a logistic regression model on transformed data

In [None]:
from sklearn.linear_model import LogisticRegression

model = LogisticRegression()
model.fit(X_train_transformed, y_train)

To score trained model against transformed testing data

In [None]:
X_test_transformed = vectorizer.transform(X_test)
print("Score: ", model.score(X_test_transformed, y_test))

To use the model for classification of new reviews (What are the problems with this model? How can we try to fix them?)

In [None]:
def predict(text_content):
    vector = vectorizer.transform([text_content])
    return text_content, "positive" if model.predict(vector)[0] else "negative", model.predict_proba(vector)

print(predict("The flight was quite awesome"))
print(predict("It was expensive and drinks were awful"))
print(predict("The stewardess spilled champagne all over me!"))
print(predict("lol"))
print(predict("Rewelacja. Najwspanialszy lot mojego życia!"))
print(predict("Rewelacja. Najwspanialsza podróż mojego życia!"))
print(predict("Dramat. Nigdy więcej nie polecę tymi liniami! :("))

Instead of using single words, we can use n-grams which are sequences of words, preserving order from the source text. This approach allows model to recognize more complicated phrases but is more computationally intensive.

Below we recreate transformer and model using bigrams (two-element sequences of words). You should notice that training the model takes more time than before.

In [None]:
vectorizer_bigrams = CountVectorizer(binary=True, ngram_range=(1, 2))
X_train_transformed_b = vectorizer_bigrams.fit_transform(X_train)
model_bigrams = LogisticRegression()
model_bigrams.fit(X_train_transformed_b, y_train)
X_test_transformed_b = vectorizer_bigrams.transform(X_test)
print("Score: ", model_bigrams.score(X_test_transformed_b, y_test))

Trained model and transformer can be then serialized and saved to file using Python serialization module - Pickle. This way they can be used outside of this session to classify new reviews without access to training data. 

Keep in mind that model may be not loaded properly from the file if you will use different Python or Scikit-learn version.

In [None]:
import pickle

MODEL_PATH = "log_reg_model.pkl"
VECTORIZER_PATH = "vectorizer.pkl"

pickle.dump(vectorizer_bigrams, open(VECTORIZER_PATH, 'wb'))
pickle.dump(model_bigrams, open(MODEL_PATH, 'wb'))

To load the model and check if we get the same score.

In [None]:
loaded_vectorizer = pickle.load(open(VECTORIZER_PATH, 'rb'))
loaded_model = pickle.load(open(MODEL_PATH, 'rb'))

X_test_transformed_b_new = loaded_vectorizer.transform(X_test)
print("Score: ", loaded_model.score(X_test_transformed_b_new, y_test))