# The Proceedure short description

Missing Data was replaced using median imputation.
The text was cleaned of stop words and numbers, then the sentiment and emotion scores was mined through a lexicon based approach (NRC Word-Emotion Association Lexicon). The results were compared to an expert sampling which proved the correlation of both approaches.
Attribute specific sentiments were mined through extracting all the through Ripple Down Rules-based Part-Of-Speech Tagging (RDRPOS) identified nouns which occurred in at least 5% of the dataset. The resulting 116 Nouns were put into “Bag of Words” created by 3 experts and 10 users of online reviews and airlines to identify the service attributes.
Each sentence containing a word in a service attribute was matched with its sentiment score to create the final dataset.


## Feature Selection
"content","cabin_flown","overall_rating","seat_comfort_rating","cabin_staff_rating","food_beverages_rating","inflight_entertainment_rating","value_money_rating","recommended"

In [15]:
import pandas as pd

df = pd.read_csv("airline_mean.csv", encoding='latin1')
df = df.iloc[:, 7:]
df = df.drop(columns=['aircraft', 'type_traveller'])

## Sentiment Analysis
With NRC Word-Emotion Association Lexicon (also called EmoLex)

In [19]:
from nrclex import NRCLex

df = df.merge(df.content.apply(lambda s: pd.Series(NRCLex(s).affect_frequencies)), 
    left_index=True, right_index=True)

                                                 content     cabin_flown  \
0      Outbound flight FRA/PRN A319. 2 hours 10 min f...         Economy   
1      Two short hops ZRH-LJU and LJU-VIE. Very fast ...  Business Class   
2      Flew Zurich-Ljubljana on JP365 newish CRJ900. ...         Economy   
3      Adria serves this 100 min flight from Ljubljan...  Business Class   
4      WAW-SKJ Economy. No free snacks or drinks on t...         Economy   
...                                                  ...             ...   
41391  This airline is terrible! Timetable changes (m...         Economy   
41392  We often fly with Wizzair to/from Charleroi/Bu...         Economy   
41393  Avoid Wizzair! A group of us had our outgoing ...         Economy   
41394  PRG-LTN and LTN-PRG were rather good flights. ...         Economy   
41395  London - Kiev. First problem started a few wee...         Economy   

       overall_rating  seat_comfort_rating  cabin_staff_rating  \
0                   7

'\n#Return words list.\n\ntext_object.words\n\n#Return sentences list.\n\ntext_object.sentences\n\n#Return affect list.\n\ntext_object.affect_list\n\n#Return affect dictionary.\n\ntext_object.affect_dict\n\n#Return raw emotional counts.\n\ntext_object.raw_emotion_scores\n\n#Return highest emotions.\n\ntext_object.top_emotions\n\n#Return affect frequencies.\n\ntext_object.affect_frequencies\n'

### Overall Sentiment
= positive - negative / positive + negative

In [20]:
df['overall_sentiment'] = (df.positive - df.negative)/(df.positive + df.negative)

## Bag of Words to find service attributes

In [35]:
services = {}
services['punctuality_sm'] = ['time', 'hour', 'boarding', 'minutes', 'delay', 'day' 'arrival','departure']
services['food_bev_sm'] = 'Food, meal, drinks, water, breakfast'.lower().split(', ')
services['comfort_sm'] = 'Seat, cabin, leg, lounge, room, legroom, seating, row, landing, sleep, space'.lower().split(', ')
services['staff_bh_sm'] = 'Crew, staff, attendants, people, ground'.lower().split(', ')
services['inflight_ent_sm'] = 'Entertainment, experience, inflight, movies'.lower().split(', ')
services['checkin_sm'] = 'Check in, luggage, board, baggage'.lower().split(', ')

### Feature Engineering
We go through each review, split it into its composing sentences and see if it contains words from the services defined previously. We combine the sentences by averaging the sentiment scores for each category and in the end fill the missing values with the mean.

In [47]:
def extract_sentiment_service(text):
    service_sentiments = service_osm_list(text)
    sm = pd.DataFrame(service_sentiments)
    answer = sm.mean()
    return answer
    
def service_osm_list(text):
    sentences = NRCLex(text).sentences
    sentiments = []
    for sentence in sentences:
        sentence = str(sentence)
        result = {}
        osm = overall_sentiment_in_text(sentence)
        srvc = services_in_sentence(sentence)
        for service in srvc:
            result[service] = osm
        sentiments.append(result)
    return sentiments

def overall_sentiment_in_text(text):
    sentiment = NRCLex(text).affect_frequencies
    try:
        overall_sentiment = (sentiment['positive'] - sentiment['negative'])/(sentiment['positive'] + sentiment['negative'])
        return overall_sentiment
    except ZeroDivisionError:
        return 0

def services_in_sentence(sentence):
    words = sentence.lower().split()
    srv = set()
    for key, value in services.items():
        if any(item in words for item in value):
            srv.add(key)
    return srv

df_final = df.merge(df.content.apply(extract_sentiment_service), 
    left_index=True, right_index=True)

# Missing Fields were filled with the mean
df_final = df_final.fillna(df_final.mean())
print(df_final)

                                                 content     cabin_flown  \
0      Outbound flight FRA/PRN A319. 2 hours 10 min f...         Economy   
1      Two short hops ZRH-LJU and LJU-VIE. Very fast ...  Business Class   
2      Flew Zurich-Ljubljana on JP365 newish CRJ900. ...         Economy   
3      Adria serves this 100 min flight from Ljubljan...  Business Class   
4      WAW-SKJ Economy. No free snacks or drinks on t...         Economy   
...                                                  ...             ...   
41391  This airline is terrible! Timetable changes (m...         Economy   
41392  We often fly with Wizzair to/from Charleroi/Bu...         Economy   
41393  Avoid Wizzair! A group of us had our outgoing ...         Economy   
41394  PRG-LTN and LTN-PRG were rather good flights. ...         Economy   
41395  London - Kiev. First problem started a few wee...         Economy   

       overall_rating  seat_comfort_rating  cabin_staff_rating  \
0                   7

In [49]:
df_final.drop(columns=['content']).to_csv('final_data.csv',index=False)