<a href="https://colab.research.google.com/github/aptitude-learn/AI_projects/blob/main/Airline_Customer_Review_(Part_5).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction

In this section, we will focus on getting to know our data. This is important for two key reasons:

1. Looking closely at our data helps us find patterns, spot missing information, and check if some parts of the data are uneven. Fixing these issues is important because it ensures we get accurate and useful results. By using good quality data, we can give Delta Airlines helpful insights that they can use to make better decisions.

2. When sharing your results, it’s important to explain where your data comes from. Giving a clear picture of the data you used helps others trust your findings and understand them better.

### So far, you've completed

#### Data Loading



Using a library called pandas, we will load the dataset from a public link hosted on GitHub.

In [None]:
import pandas as pd

In [None]:
data = pd.read_csv('https://gist.githubusercontent.com/almagashi/e8d9e1539069115e00a5a7246fc5cb54/raw/00156b710c9659c59470e77755f26d97e64425f1/airline_data.csv')

Now that our data is loaded, we will take a look at our dataset. Pandas function `.head()` helps us view the first few columns and rows in our data.

In [None]:
data.head()

To understand better what columns we have data for, let's print out only the column names.

In [None]:
data.columns

Now that we know the columns we're working with, let's dive deeper and take a look at the basic statistics about the values these columns contains, specifically the numerical columns.

This step helps us understand the amount and range of data we have, and helps us spot outliers (anomalies) in our data. In this case, we will move forward with the code, as nothing seems out of order.

In [None]:
data.describe()

#### Missing Data and Duplicate Data

Let's take a look at the missing data. Missing data refers to rows or columns that lack any (readable) value.

Addressing missing values is a crucial step in the sentiment analysis pipeline, as it can impact the quality of our analysis. We can handle missing data by either removing the incomplete entries or replacing them using a systematic approach, such as imputation.

In [None]:
# missing data counts
print("Percentage null or na values in our data:")
((data.isnull() | data.isna()).sum() * 100 / data.index.size).round(2)

Based on the amount of missing values, and the information the columns contain, we will choose to remove a few irrelevant or useless columns.

In [None]:
# dropping irrelevant columns

data = data.drop(['_unit_id', '_golden', '_unit_state', '_trusted_judgments',
       '_last_judgment_at', 'airline_sentiment_gold', 'name', 'retweet_count',
          'negativereason_gold','tweet_coord',
          'tweet_location','user_timezone'], axis=1)

In [None]:
data.head()

Sometimes, data has duplicate values, due to errors in collection or human error (i.e. submitting the same review twice). These errors should not be reflected in our final analysis, therefore we will check if duplicates exist and remove them.

In [None]:
# check duplicate values

print(f"Number of duplicate rows: {data.duplicated().sum()}")

In [None]:
# drop duplicate values

data = data.drop_duplicates()
print(f"Number of duplicate rows after dropping duplicates: {data.duplicated().sum()}")

Looks like there are no duplicate values! Let's move to distributions.

#### Data Distributions

After examining the data for outliers and missing values, we move to understanding how our data disperses.

For example, let's explore how many airlines we are analyzing, how many reviews we have for each airline, and how many of those reviews are negative. This will be important information when presenting your results as your resuls will only be impactful when presented contextually.

In [None]:
# count airline data

from matplotlib import pyplot as plt
import seaborn as sns
data.groupby('airline').size().plot(kind='barh', color=sns.palettes.mpl_palette('Dark2'))
plt.gca().spines[['top', 'right',]].set_visible(False)

In [None]:
# plot sentiment distribution

data['airline_sentiment'].value_counts().plot(kind='bar', color=sns.palettes.mpl_palette('Dark2'))
plt.gca().spines[['top', 'right',]].set_visible(False)

In [None]:
# show sentiment values per airline

a = data.groupby(['airline', 'airline_sentiment'])['airline_sentiment'].count().unstack().plot(kind='bar', stacked=False, color=sns.palettes.mpl_palette('Dark2'))

In [None]:
# show the distributions of negativereason

data['negativereason'].value_counts().plot(kind='bar', color=sns.palettes.mpl_palette('Dark2'))
plt.gca().spines[['top', 'right',]].set_visible(False)

Now let's see what are the top 3 reasons for each airline's complaints.

In [None]:
# group the data by airline and negativereason, then count the values
airline_negativereason_counts = data.groupby(['airline', 'negativereason'])['negativereason'].count().unstack()

# for each airline, get the top 3 negative reasons
for airline in airline_negativereason_counts.index:
  top_3_reasons = airline_negativereason_counts.loc[airline].nlargest(3)

  # create a bar plot for the top 3 reasons for the current airline
  top_3_reasons.plot(kind='bar', color=sns.palettes.mpl_palette('Dark2'))
  plt.title(f'Top 3 Negative Reasons for {airline}')
  plt.xlabel('Negative Reason')
  plt.ylabel('Count')
  plt.gca().spines[['top', 'right',]].set_visible(False)
  plt.show()

#### Data Cleaning

We will apply a few standard steps here for cleaning tweets and extracting most meaning out of it:
* remove all links
* keep only letters, no emojis
* convert all letters to lowercase, and split sentences into words (tokens)
* define a set of common English stopwords like: the, at, is, etc.
* only keep words that are not in the English stopwords set, because these are where we can extract most meaning from
* store the cleaned data into a new column called cleaned_tweet

In [None]:
# import and download necessary libraries
import re
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords

# clean tweets

def tweet_to_words(tweet):
    nolinks = re.sub(r"http\S+", "", tweet)
    letters_only = re.sub("[^a-zA-Z]", " ",nolinks)
    words = letters_only.lower().split()
    stops = set(stopwords.words("english"))
    meaningful_words = [w for w in words if not w in stops]
    return( " ".join( meaningful_words ))

In [None]:
# store clean tweets into new column called clean_tweet

data['clean_tweet']=data['text'].apply(lambda x: tweet_to_words(x))

Now let's peek at the data. We see that a new column has been added at the end with the clean text. This is the column we will use in our model.

In [None]:
data.head()

Let's observe how a clean tweet looks against its original counterpart.

In [None]:
print('original tweet:', data.text[75])
print('cleaned tweet:', data.clean_tweet[75])

We can see that the most important information has been preserved. We're ready to convert this data into numerical values.

#### Data Transformation

When we look back at the sentiment distribution, we see most of our data belongs to negative category. he class imbalance might create issues when training our models. Since the task is mainly to classify the negative review, combining neutral and positive into 1 category might help in model training.

This is why we will convert all neutral and positive sentiment values into 1, representing positive sentiment and all negative sentiment values to 0.

In [None]:
data['sentiment_numeric'] = data['airline_sentiment'].map({'negative': 0, 'neutral': 1, 'positive': 1})

In [None]:
data.head()

#### Data Vectorization

To get ready for our machine learning algorithm, we need to split the data into two parts: training data and testing data.

* The training data is used to teach the algorithm by showing it many examples with labels. For example, we tell the algorithm that "awful experience" is negative and "super experience" is positive.

* The test data is used to see how well the algorithm has learned. We give it new examples, like "mindblowing experience," and see if it predicts the correct label. If the algorithm guesses wrong (like predicting negative when the correct label is positive), this mistake is counted. The percentage of wrong guesses tells us how well the algorithm is performing. We will use these sets to evaluate how our algorithm is performing.

The reason we will vectorize these sets separately is to avoid creating any connections between training and test data, so that test data is treated as new unseen data.

After all, we want the algorithm in the future to predict any tweet into the correct labels.

In [None]:
X = data['clean_tweet'] # the data we will feed (input)
y = data['sentiment_numeric'] # the labels it will learn against (output)

In [None]:
from sklearn.model_selection import train_test_split

# split the data into 80% training, 20% testing randomly

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Now we will vectorize the train and test data separately.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer()
tfidf.fit(X_train)

In [None]:
X_train_tfidf = tfidf.transform(X_train)
X_test_tfidf = tfidf.transform(X_test)

After the vectorizer turns the data into vectors, you should have your data look like a matrix. Let's see if that's the format we have our data on right now:

In [None]:
X_train_tfidf

In [None]:
X_test_tfidf

#### Choosing models

In [None]:
# import the chosen models

from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
from sklearn.naive_bayes import MultinomialNB

#### Training the models

In [None]:
# activate and train Logistic Regression Model

log = LogisticRegression(max_iter=1000)
log.fit(X_train_tfidf,y_train)

We know the logistic model has ran when the output shows `LogisticRegression(max_iter=1000)`.

In [None]:
# activate and train Linear Support Vector Machine Model

svc = LinearSVC()
svc.fit(X_train_tfidf,y_train)

In [None]:
# activate and train Multinomial Naive Bayes Model

nb = MultinomialNB()
nb.fit(X_train_tfidf,y_train)

Now that all models have ran and trained successfully, let's move to testing and evaluating their performance.

#### Testing the models

In [None]:
preds_log = log.predict(X_test_tfidf)
preds_svc = svc.predict(X_test_tfidf)
preds_nb = nb.predict(X_test_tfidf)

#### Metrics of Evaluation

Let's begin with the most common, accuracy.

**Accuracy**

As the minimum benchmark, a model is doing good, when it classifies better than chance (chance is when accuracy is at 50%, because if we were to randomly guess whether a review is positive or negative, we will likely get it right about 50% of the time). As such accuracy being above 50% is the minimum requirement.

However, depending on the problem at hand, we need to consider a few more metrics. If the classifier will be tasked at classifyigng whether patients have cancer or not, we sure need much higher accuracy, on top of, making sure the classifier doesn't make critical mistakes. This brings us to two important types of errors: Type I and Type II errors, which are especially important in problems like diagnosing cancer.

* **Type I Error** (False Positive): This occurs when the model predicts something as positive when it’s actually negative. For example, predicting a patient has cancer when they do not.
* **Type II Error** (False Negative): This happens when the model predicts something as negative when it’s actually positive. For instance, predicting a patient does not have cancer when they actually do.

To measure how well our model avoids these mistakes, we use a few key metrics:

1. **Precision**: Precision helps us understand how often the model’s positive predictions are correct. It is the ratio of true positives (correct positive predictions) to all positive predictions (true positives + false positives). If we care more about avoiding Type I errors (false positives), we want high precision. For example, in spam detection, we want to avoid marking important emails as spam (false positives).

2. **Recall**: Recall tells us how well the model captures all actual positive cases. It is the ratio of true positives to all actual positives (true positives + false negatives). In cases like cancer detection, avoiding Type II errors (false negatives) is critical, so we focus on maximizing recall, ensuring we detect as many positive cases as possible.


3. **F1 Score**: The F1 score is a balance between precision and recall, providing a single measure of the model’s performance when both false positives and false negatives matter. It’s the harmonic mean of precision and recall, giving a better sense of the model’s effectiveness when we need to balance both types of errors.


By using these metrics alongside accuracy, we ensure that our model not only makes accurate predictions but also minimizes the most harmful errors for the task at hand.

When deciding how to improve a model, we choose the metric that best fits the problem. In our case, we need to make sure most negative reviews are flagged, and not be mistakenly labelled as "positive". However, we also need to make sure that Delta employers who are working hard do not have to sift through positive reviews in the negative review bunch. So we will choose our model based on F-1 Score.

In [None]:
from sklearn.metrics import classification_report, accuracy_score

# create a report of all metrics of classification for logisitic regression

metrics_log = classification_report(y_test,preds_log)

# print all metric results

print(metrics_log)

In [None]:
# create a report of all metrics of classification for support vector machine

metrics_svc = classification_report(y_test,preds_svc)

# print all metric results

print(metrics_svc)

In [None]:
# create a report of all metrics of classification for multinomial naive bayes

metrics_nb = classification_report(y_test,preds_nb)

# print all metric results

print(metrics_nb)

These are the scores we are interested in:


| Model                     | Accuracy | F1-Score|
|----------------------------|----------|---------------------------|
| Logistic Regression         | 83%      | 83%                       |
| Support Vector Machine      | 83%      | 83%                       |
| Multinomial Naive Bayes     | 80%      | 79%                       |


All of our models have an accuracy score that meets the minimum standard.

While Logistic Regression and Support Vector Machines show slightly higher accuracy than Multinomial Naive Bayes, the F1-Score tells a more balanced story. The F1-Score balances precision and recall, making it a better measure when both false positives and false negatives matter.

In this case, the F1-Score for Multinomial Naive Bayes is slightly lower than for the other models. This means it struggles more in correctly classifying both positive and negative reviews, compared to the other models, which perform slightly better overall.

However, since the F1-Score difference is small, the choice between models depends on priorities. If Delta prefers catching all negative reviews and is okay with a few false positives, they might still consider Multinomial Naive Bayes. But based on the balanced performance of both precision and recall, Logistic Regression or Support Vector Machines would be stronger choices for more consistent overall performance.

# Finally...

The whole purpose of this classifier is to work beyond the dataset we worked with. For example, any new reviews that come through, this classifier should be able to guess correctly about 80% of the time if they're negative, and flag them to Delta's customer service team.

So let's test and see if our model will work with any reviews:

In [None]:
from sklearn.pipeline import Pipeline

# create a pipeline to bring together the vectorizer and classifier and train it

pipe = Pipeline([('tfidf',TfidfVectorizer()),('Logistic Regression', LogisticRegression())])
pipe.fit(data['clean_tweet'], data['sentiment_numeric'])

In [None]:
from numpy import array

# convert our score from 0 and 1 to negative or non-negative, respectively

def predict(new_tweet):
  prediction = pipe.predict(new_tweet)
  if prediction == array([0]):
    print('Negative')
  else:
    print('Non-negative')

In [None]:
# test with real review taken from TripAdvisor

new_tweet = ['we had a terrible experience with delta.having a very long flight from Cancun to Warsaw with 2 stops, we bought tickets quite in advance and paid extra to select seats...']
predict(new_tweet)

In [None]:
# test with real review taken from TripAdvisor

new_tweet = ['Wow, what great service, pricing and travel experience my husband, and I just had, 8-23-24...']
predict(new_tweet)

And that concludes this project. You've successfully gone through all the steps below:
1. Data Loading
2. Missing Data and Duplicate Data
3. Data Distributions
4. Data Cleaning
5. Data Transformation
6. Data Vectorization
7. Choosing models
8. Training the models
9. Testing the models
10. Metrics of evaluation

and finally:

11. Testing your model with unseen data