# Task 1

---

## Web scraping and analysis

This Jupyter notebook includes some code to get you started with web scraping. We will use a package called `BeautifulSoup` to collect the data from the web. Once you've collected your data and saved it into a local `.csv` file you should start with your analysis.

### Scraping data from Skytrax

If you visit [https://www.airlinequality.com] you can see that there is a lot of data there. For this task, we are only interested in reviews related to British Airways and the Airline itself.

If you navigate to this link: [https://www.airlinequality.com/airline-reviews/british-airways] you will see this data. Now, we can use `Python` and `BeautifulSoup` to collect all the links to the reviews and then to collect the text data on each of the individual review links.

In [3]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

In [56]:
base_url = "https://www.airlinequality.com/airline-reviews/british-airways"
pages = 10
page_size = 100

reviews = []

# for i in range(1, pages + 1):
for i in range(1, pages + 1):

    print(f"Scraping page {i}")

    # Create URL to collect links from paginated data
    url = f"{base_url}/page/{i}/?sortby=post_date%3ADesc&pagesize={page_size}"

    # Collect HTML data from this page
    response = requests.get(url)

    # Parse content
    content = response.content
    parsed_content = BeautifulSoup(content, 'html.parser')
    for para in parsed_content.find_all("div", {"class": "text_content"}):
        reviews.append(para.get_text())
    
    print(f"   ---> {len(reviews)} total reviews")

Scraping page 1
   ---> 100 total reviews
Scraping page 2
   ---> 200 total reviews
Scraping page 3
   ---> 300 total reviews
Scraping page 4
   ---> 400 total reviews
Scraping page 5
   ---> 500 total reviews
Scraping page 6
   ---> 600 total reviews
Scraping page 7
   ---> 700 total reviews
Scraping page 8
   ---> 800 total reviews
Scraping page 9
   ---> 900 total reviews
Scraping page 10
   ---> 1000 total reviews


In [57]:
df = pd.DataFrame()
df["reviews"] = reviews
df.head()

Unnamed: 0,reviews
0,Not Verified | At Copenhagen the most chaotic...
1,✅ Trip Verified | Worst experience of my life...
2,✅ Trip Verified | Due to code sharing with Ca...
3,✅ Trip Verified | LHR check in was quick at t...
4,✅ Trip Verified | I wouldn't recommend Britis...


In [58]:
df.to_csv("../DS-data/BA_reviews.csv")



Congratulations! Now you have your dataset for this task! The loops above collected 1000 reviews by iterating through the paginated pages on the website. However, if you want to collect more data, try increasing the number of pages!

 The next thing that you should do is clean this data to remove any unnecessary text from each of the rows. For example, "✅ Trip Verified" can be removed from each row if it exists, as it's not relevant to what we want to investigate.

## Load data

In [1]:
# load data
import pandas as pd 
import nltk


nltk.download('omw-1.4')
nltk.download('wordnet')
nltk.download('stopwords')
df = pd.read_csv('../DS-data/BA_reviews.csv')

[nltk_data] Downloading package omw-1.4 to
[nltk_data]     /home/ankitsingh/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /home/ankitsingh/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /home/ankitsingh/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


## Preprocessing

### Resources

- <a>https://www.machinelearningplus.com/nlp/topic-modeling-visualization-how-to-present-results-lda-models/</a>

- <a>https://amueller.github.io/aml/05-advanced-topics/13-text-data.html</a>

- <a>https://www.toptal.com/python/topic-modeling-python</a>
- <a>https://github.com/susanli2016/NLP-with-Python/blob/master/Topic%20Modeling%20for%20Data%20Preprocessing.ipynb</a>

### steps

- Remove punctuations
- Lowercase letters. Make all words lowercase. Make all words lowercase. The meaning of a word does not change regardless of its position in the sentence.

- n-grams. Consider all groups of n words in a row as new terms, called n-grams. This way, cases such as “white house” will be taken into account and added to the vocabulary list.

- Stemming. Identify prefixes and suffixes of words to isolate them from their root. This way, words like “play,” “played,” or “player” are represented by the word “play.” Stemming can be useful to reduce the number of words in the vocabulary list while preserving their meaning , but it slows preprocessing considerably because it must be applied to each word in the corpus.

- Lemmatize a document typically means to “doing things correctly” since it involves using a vocabulary and performing morphological analysis of words to remove only the inflectional ends and return the base or dictionary form of a word, which is known as the “lemma.” For example, you can expect a lemmatization algorithm to map “runs,” “running,” and “ran” to the lemma, “run.” 

- Stop words. Do not take into account groups of words lacking in meaning or utility. These include articles and prepositions but may also include words that are not useful for our specific case study, such as certain common verbs.

- Term frequency–inverse document frequency (tf–idf). Use the coefficient of tf–idf instead of noting the frequency of each word within each cell of the matrix. It consists of two numbers, multiplied:

    - tf—the frequency of a given term or word in a text, and
    - idf—the logarithm of the total number of documents divided by the number of documents that contain that given term.


## Analysis

### Creating term document matrix and cleaning data



In [5]:
# removing stop words, lemmatise the reviews
# and remove the punctuations

from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer
import string
from gensim.models.phrases import Phrases, ENGLISH_CONNECTOR_WORDS
import gensim 
from gensim.utils import simple_preprocess
import spacy 
import json

stop_words = stopwords.words('english')

stop_words = stop_words + ['ba','organization', 'would', 'article', 'could']

stop = set(stopwords.words('english'))
punct = set(string.punctuation)
lemma = WordNetLemmatizer()

# Split the reviews into list of words and remove '|'

def preprocess(review):
    review = review.split('|')[1]
    review = simple_preprocess(str(review), deacc=True) 
    return review

# Removing stopwords, get biagrams and trigrams and lemmatize the words
def clean_reviews(all_words, stop_words,postags=['NOUN', 'ADJ', 'VERB', 'ADV']):
    
    # remove stop words and form bigrams and trigrams
    all_words = [[word for word in simple_preprocess(str(doc)) if word not in stop_words] for doc in all_words]
    all_words = [modbi[doc] for doc in all_words]
    all_words = [modtri[modbi[doc]] for doc in all_words]

    # lemmatize using spacy kkeping only postags
    final_words = []
    nlp = spacy.load('en_core_web_sm', disable=['parser', 'ner'])
    for sent in all_words:
        doc = nlp(" ".join(sent)) 
        final_words.append([token.lemma_ for token in doc if token.pos_ in postags])
    
    # remove stopwords again after lemmatization
    final_words = [[word for word in simple_preprocess(str(doc)) if word not in stop_words] for doc in final_words]

    return final_words


data = df.reviews.values.tolist()
wordlist_review = [preprocess(rev) for rev in data]

#build bigrams with unjumbled words

bigram = gensim.models.Phrases(wordlist_review, min_count=5, threshold=100)
trigram = gensim.models.Phrases(bigram[wordlist_review], threshold=100)

#build bigram & trigrams models
modbi = gensim.models.phrases.Phraser(bigram)
modtri = gensim.models.phrases.Phraser(trigram)

# Preprocess the data
prepro_data = clean_reviews(wordlist_review,stop_words)

# save the preprocessed data
with open("../DS-data/prepro.json", "w") as fp:
    json.dump(prepro_data, fp)


In [6]:
# Read preprocessed data

with open("../DS-data/prepro.json", "r") as fp:
    prepro_data = json.load(fp)



In [9]:
import gensim.corpora as corpora

# make dictionary
id2word = corpora.Dictionary(prepro_data)

# make corpus: bag of word , term frequency document
word_corpus = [id2word.doc2bow(word) for word in prepro_data]

# make LDA model with the dictionary and corpus

lda_model = gensim.models.ldamodel.LdaModel(corpus=word_corpus,
                                           id2word=id2word,
                                           num_topics=4, 
                                           random_state=100,
                                           update_every=1,
                                           chunksize=10,
                                           passes=10,
                                           alpha='symmetric',
                                           iterations=100,
                                           per_word_topics=True)


lda_model.save('../DS-data/LDA_model')


In [15]:
# Import LDA model
from gensim import  models

lda_model = models.ldamodel.LdaModel.load('../DS-data/LDA_model')

In [24]:
for row_list in lda_model[word_corpus]:
    print(row_list[0])
    break 

[(0, 0.14349383), (1, 0.5167112), (2, 0.015227525), (3, 0.32456747)]


In [19]:
def format_topics_sentences(ldamodel=None, corpus=word_corpus, texts=data):
    # Init output
    sent_topics_df = pd.DataFrame()

    # Get main topic in each document
    for row_list in ldamodel[corpus]:
        row = row_list[0] if ldamodel.per_word_topics else row_list            
        row = sorted(row, key=lambda x: (x[1]), reverse=True)
        # Get the Dominant topic, Perc Contribution and Keywords for each document
        for j, (topic_num, prop_topic) in enumerate(row):
            if j == 0:  # => dominant topic
                wp = ldamodel.show_topic(topic_num)
                topic_keywords = ", ".join([word for word, prop in wp])
                sent_topics_df = sent_topics_df.append(pd.Series([int(topic_num), round(prop_topic,4), topic_keywords]), ignore_index=True)
            else:
                break
    sent_topics_df.columns = ['Dominant_Topic', 'Perc_Contribution', 'Topic_Keywords']

    # Add original text to the end of the output
    contents = pd.Series(texts)
    sent_topics_df = pd.concat([sent_topics_df, contents], axis=1)
    return(sent_topics_df)

df_topic_sents_keywords = format_topics_sentences(ldamodel=lda_model, corpus=word_corpus, texts=prepro_data)



[(0, 0.14349449), (1, 0.5167086), (2, 0.015227525), (3, 0.32456934)]
[(0, 0.06650689), (1, 0.81923646), (2, 0.021250708), (3, 0.09300598)]
[(0, 0.030329484), (1, 0.11243532), (2, 0.41896883), (3, 0.43826637)]
[(0, 0.4931679), (1, 0.2801929), (2, 0.1443697), (3, 0.0822695)]
[(0, 0.13109584), (1, 0.8619765)]
[(1, 0.24880126), (2, 0.014480376), (3, 0.7339428)]
[(0, 0.08449418), (1, 0.014094988), (2, 0.013973923), (3, 0.88743687)]
[(0, 0.14744668), (1, 0.42015305), (2, 0.19852911), (3, 0.23387119)]
[(0, 0.09417128), (2, 0.2215012), (3, 0.67839086)]
[(0, 0.39808503), (1, 0.29305482), (2, 0.2226832), (3, 0.086176924)]
[(0, 0.7851206), (1, 0.101938985), (2, 0.069241166), (3, 0.043699216)]
[(0, 0.15860361), (1, 0.24612814), (3, 0.5869118)]
[(0, 0.31769934), (1, 0.31151056), (2, 0.08065431), (3, 0.29013583)]
[(0, 0.5508884), (1, 0.22335942), (3, 0.21872869)]
[(0, 0.43703252), (1, 0.088998646), (3, 0.46965)]
[(0, 0.28317013), (1, 0.48431158), (3, 0.22813071)]
[(0, 0.353578), (1, 0.30103993), (2,

  sent_topics_df = sent_topics_df.append(pd.Series([int(topic_num), round(prop_topic,4), topic_keywords]), ignore_index=True)


[(0, 0.10776449), (1, 0.58035696), (2, 0.057702266), (3, 0.25417632)]
[(0, 0.38563582), (1, 0.06846422), (2, 0.22198962), (3, 0.3239103)]
[(0, 0.18653929), (1, 0.4331479), (2, 0.012032068), (3, 0.3682807)]
[(0, 0.63153565), (1, 0.01158528), (2, 0.05901209), (3, 0.297867)]
[(0, 0.32347825), (1, 0.19312945), (2, 0.09977831), (3, 0.383614)]
[(0, 0.066733636), (1, 0.6030934), (2, 0.045866176), (3, 0.2843068)]
[(0, 0.5753147), (1, 0.080291204), (2, 0.3286689), (3, 0.015725182)]
[(0, 0.29213384), (1, 0.29966757), (2, 0.32382736), (3, 0.08437122)]
[(0, 0.3722718), (1, 0.039052144), (2, 0.4646014), (3, 0.12407467)]
[(0, 0.1282625), (1, 0.16382091), (3, 0.70185906)]
[(0, 0.51387835), (2, 0.16685084), (3, 0.31046075)]
[(0, 0.69227856), (1, 0.013347143), (2, 0.280955), (3, 0.013419326)]
[(0, 0.51865137), (1, 0.048859455), (2, 0.33560288), (3, 0.09688631)]
[(0, 0.15496661), (1, 0.2417448), (2, 0.18170404), (3, 0.4215845)]
[(0, 0.1225221), (1, 0.7962384), (3, 0.07588107)]
[(0, 0.09097805), (1, 0.53

In [14]:
# Format
df_dominant_topic = df_topic_sents_keywords.reset_index()
df_dominant_topic.columns = ['Document_No', 'Dominant_Topic', 'Topic_Perc_Contrib', 'Keywords', 'Text']

df_dominant_topic.head(10)

Unnamed: 0,Document_No,Dominant_Topic,Topic_Perc_Contrib,Keywords,Text
0,0,1,0.5167,"plane, service, customer, say, day, ask, fligh...","[copenhagen, chaotic, ticket, counter, assignm..."
1,1,1,0.8192,"plane, service, customer, say, day, ask, fligh...","[bad, experience, life, try, deal, customer, s..."
2,2,3,0.4383,"flight, seat, get, airline, hour, fly, time, d...","[due, code, sharing, downgrade, return, leg, d..."
3,3,0,0.4932,"seat, flight, good, well, service, food, drink...","[quick, first, wing, quickly, security, first,..."
4,4,1,0.862,"plane, service, customer, say, day, ask, fligh...","[recommend, try, call, customer, service, time..."
5,5,3,0.7339,"flight, seat, get, airline, hour, fly, time, d...","[absolutely, horrible, experience, book, ticke..."
6,6,3,0.8874,"flight, seat, get, airline, hour, fly, time, d...","[bad, airline, thing, go, right, understand, g..."
7,7,1,0.4202,"plane, service, customer, say, day, ask, fligh...","[never, start, plane, hour, late, weather, rea..."
8,8,3,0.6784,"flight, seat, get, airline, hour, fly, time, d...","[bad, aircraft, ever, fly, seat, cramp, uncomf..."
9,9,0,0.3981,"seat, flight, good, well, service, food, drink...","[enjoy, flight, boarding, swift, service, frie..."
