# The Proceedure short description

Missing Data was replaced using median imputation.
The text was cleaned of stop words and numbers, then the sentiment and emotion scores was mined through a lexicon based approach (NRC Word-Emotion Association Lexicon). The results were compared to an expert sampling which proved the correlation of both approaches.
Attribute specific sentiments were mined through extracting all the through Ripple Down Rules-based Part-Of-Speech Tagging (RDRPOS) identified nouns which occurred in at least 5% of the dataset. The resulting 116 Nouns were put into “Bag of Words” created by 3 experts and 10 users of online reviews and airlines to identify the service attributes.
Each sentence containing a word in a service attribute was matched with its sentiment score to create the final dataset.


## Feature Selection
"content","cabin_flown","overall_rating","seat_comfort_rating","cabin_staff_rating","food_beverages_rating","inflight_entertainment_rating","value_money_rating","recommended"

In [15]:
import pandas as pd

df = pd.read_csv("airline_mean.csv", encoding='latin1')
df = df.iloc[:, 7:]
df = df.drop(columns=['aircraft', 'type_traveller'])

## Sentiment Analysis
With NRC Word-Emotion Association Lexicon (also called EmoLex)

In [50]:
from nrclex import NRCLex

df = df.merge(df.content.apply(lambda s: pd.Series(NRCLex(s).affect_frequencies)), 
    left_index=True, right_index=True)

KeyboardInterrupt: 

### Overall Sentiment
= positive - negative / positive + negative

In [20]:
df['overall_sentiment'] = (df.positive - df.negative)/(df.positive + df.negative)

## Bag of Words to find service attributes

In [35]:
services = {}
services['punctuality_sm'] = ['time', 'hour', 'boarding', 'minutes', 'delay', 'day' 'arrival','departure']
services['food_bev_sm'] = 'Food, meal, drinks, water, breakfast'.lower().split(', ')
services['comfort_sm'] = 'Seat, cabin, leg, lounge, room, legroom, seating, row, landing, sleep, space'.lower().split(', ')
services['staff_bh_sm'] = 'Crew, staff, attendants, people, ground'.lower().split(', ')
services['inflight_ent_sm'] = 'Entertainment, experience, inflight, movies'.lower().split(', ')
services['checkin_sm'] = 'Check in, luggage, board, baggage'.lower().split(', ')

### Feature Engineering
We go through each review, split it into its composing sentences and see if it contains words from the services defined previously. We combine the sentences by averaging the sentiment scores for each category and in the end fill the missing values with the mean.

In [51]:
def extract_sentiment_service(text):
    service_sentiments = service_osm_list(text)
    sm = pd.DataFrame(service_sentiments)
    answer = sm.mean()
    return answer
    
def service_osm_list(text):
    sentences = NRCLex(text).sentences
    sentiments = []
    for sentence in sentences:
        sentence = str(sentence)
        result = {}
        osm = overall_sentiment_in_text(sentence)
        srvc = services_in_sentence(sentence)
        for service in srvc:
            result[service] = osm
        sentiments.append(result)
    return sentiments

def overall_sentiment_in_text(text):
    sentiment = NRCLex(text).affect_frequencies
    try:
        overall_sentiment = (sentiment['positive'] - sentiment['negative'])/(sentiment['positive'] + sentiment['negative'])
        return overall_sentiment
    except ZeroDivisionError:
        return 0

def services_in_sentence(sentence):
    words = sentence.lower().split()
    srv = set()
    for key, value in services.items():
        if any(item in words for item in value):
            srv.add(key)
    return srv

df_final = df.merge(df.content.apply(extract_sentiment_service), 
    left_index=True, right_index=True)

# Missing Fields were filled with the mean
df_final = df_final.fillna(df_final.mean())
df_final.head(10)

Unnamed: 0,content,cabin_flown,overall_rating,seat_comfort_rating,cabin_staff_rating,food_beverages_rating,inflight_entertainment_rating,value_money_rating,recommended,fear,...,sadness,disgust,joy,overall_sentiment,food_bev_sm,punctuality_sm,comfort_sm,staff_bh_sm,checkin_sm,inflight_ent_sm
0,Outbound flight FRA/PRN A319. 2 hours 10 min f...,Economy,7,4,4,4,2,4,1,0.101604,...,0.101604,0.096257,0.090909,-0.027027,0.0,0.0,0.0,0.067703,0.031233,0.09955
1,Two short hops ZRH-LJU and LJU-VIE. Very fast ...,Business Class,10,4,5,4,1,5,1,0.102564,...,0.089744,0.089744,0.102564,0.058824,0.107043,0.020541,0.0,0.1,0.031233,0.09955
2,Flew Zurich-Ljubljana on JP365 newish CRJ900. ...,Economy,9,5,5,4,2,5,1,0.090226,...,0.105263,0.090226,0.097744,0.030303,0.090909,0.0,0.048736,0.067703,0.0,0.09955
3,Adria serves this 100 min flight from Ljubljan...,Business Class,8,4,4,3,1,4,1,0.093023,...,0.087209,0.087209,0.093023,0.121951,0.0,0.020541,0.058824,0.067703,0.031233,0.09955
4,WAW-SKJ Economy. No free snacks or drinks on t...,Economy,4,4,2,1,2,2,0,0.098266,...,0.095376,0.089595,0.095376,0.051282,0.033333,0.020541,0.048736,0.067703,0.031233,0.09955
5,Sarajevo-Frankfurt via Ljubljana. I loved flyi...,Economy,9,4,4,3,3,4,1,0.097046,...,0.092827,0.092827,0.101266,0.0,0.107043,0.020541,0.048736,0.090909,0.090909,0.090909
6,I had flights from Paris to Sarajevo via Ljubl...,Economy,5,4,4,1,2,3,1,0.097701,...,0.097701,0.086207,0.097701,0.0,0.25,0.020541,0.048736,0.067703,0.031233,0.09955
7,LJU to FRA and back both flights were on time....,Economy,9,5,5,4,3,4,1,0.086957,...,0.097826,0.086957,0.097826,0.142857,0.107043,0.020541,0.2,0.2,0.031233,0.09955
8,On my Ljubljana - Munich flight in business cl...,Business Class,8,4,3,4,1,4,1,0.103448,...,0.097179,0.094044,0.103448,-0.014925,0.0,0.020541,0.048736,0.067703,0.031233,0.09955
9,Flights from LJU to ZRH and back all on time. ...,Economy,10,5,5,4,4,4,1,0.084615,...,0.092308,0.084615,0.1,0.16129,0.066667,0.066667,0.266667,0.333333,0.031233,0.09955


In [49]:
df_final.drop(columns=['content']).to_csv('final_data.csv',index=False)