<a href="https://colab.research.google.com/github/crux007/crux007/blob/main/LDA.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [62]:
import numpy as np
import pandas as pd
import re
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import LatentDirichletAllocation

In [71]:
df = pd.read_csv('/content/drive/MyDrive/airline_review.csv')

In [72]:
df.head(5)

Unnamed: 0,airline,overall,author,review_date,customer_review,aircraft,traveller_type,cabin,route,date_flown,seat_comfort,cabin_service,food_bev,entertainment,ground_service,value_for_money,recommended
0,Adria Airways,8,M Jager,12th October 2018,âœ… Trip Verified | Ljubljana to Munich. The h...,,Family Leisure,Economy Class,Ljubljana to Munich,October 2018,4,4,3,0,5,5,yes
1,Adria Airways,1,Giulia Rossi,5th October 2018,Not Verified | Zurich to Ljubljana. Very poor ...,,Business,Economy Class,Zurich to Ljubljana,October 2018,2,1,0,1,1,1,no
2,Adria Airways,1,Galya Slavov,29th July 2018,âœ… Trip Verified | Vienna to Sofia. The fligh...,,Family Leisure,Economy Class,Vienna to Sofia,July 2018,4,1,1,0,4,1,no
3,Adria Airways,2,Loic Jouan,19th July 2018,âœ… Trip Verified | We were traveling from Par...,,Solo Leisure,Economy Class,Paris to Skopje via Ljubljana,May-18,3,3,0,0,3,2,no
4,Adria Airways,2,P Gamirj,30th June 2018,âœ… Trip Verified | Ljubljana to Munich. Adria...,,Business,Economy Class,Ljubljana to Munich,June 2018,1,2,2,0,2,1,no


In [73]:
df.shape

(61183, 17)

In [74]:
customer_review = 'customer_review'

num_empty_cells = df['customer_review'].isnull().sum()

print(f"Number of empty cells in '{customer_review}': {num_empty_cells}")

Number of empty cells in 'customer_review': 0


In [75]:
data = df[df['customer_review'].apply(lambda x: len(x) > 0)]

In [76]:
df['customer_review'] = df['customer_review'].apply(lambda x: re.sub('[^\w\s]', '', x))

In [77]:
vectorizer = TfidfVectorizer(stop_words='english')

In [78]:
# Create a LDA model
lda_model = LatentDirichletAllocation(n_components=10, random_state=0)

In [79]:
# Fit the model to the data
topic_distributions = lda_model.fit_transform(vectorizer.fit_transform(data['customer_review']))

In [80]:
# Create a sentiment analysis model
class SentimentAnalysisModel:
    def __init__(self):
        pass

    def predict(self, topic_distribution):
        if topic_distribution[0] > 0.5:
            return "positive"
        else:
            return "negative"

In [81]:
# Create an instance of the SentimentAnalysisModel
sentiment_analysis_model = SentimentAnalysisModel()

In [82]:
# Get the sentiment for each document
sentiments = [sentiment_analysis_model.predict(topic_distribution) for topic_distribution in topic_distributions]

In [83]:
# Add the 'sentiment' column to the DataFrame
data['sentiment'] = sentiments

In [84]:
# Display the DataFrame with the 'sentiment' column
print(data.head())

         airline  overall        author        review_date  \
0  Adria Airways        8       M Jager  12th October 2018   
1  Adria Airways        1  Giulia Rossi   5th October 2018   
2  Adria Airways        1  Galya Slavov     29th July 2018   
3  Adria Airways        2    Loic Jouan     19th July 2018   
4  Adria Airways        2      P Gamirj     30th June 2018   

                                     customer_review aircraft  traveller_type  \
0  âœ… Trip Verified | Ljubljana to Munich. The h...      NaN  Family Leisure   
1  Not Verified | Zurich to Ljubljana. Very poor ...      NaN        Business   
2  âœ… Trip Verified | Vienna to Sofia. The fligh...      NaN  Family Leisure   
3  âœ… Trip Verified | We were traveling from Par...      NaN    Solo Leisure   
4  âœ… Trip Verified | Ljubljana to Munich. Adria...      NaN        Business   

           cabin                          route    date_flown  seat_comfort  \
0  Economy Class            Ljubljana to Munich  October 2018

To extract the top words associated with positive sentiment or words that indicate "happy" from each topic


In [94]:
# Fit the vectorizer and transform the data
tfidf_matrix = vectorizer.fit_transform(data['customer_review'])

In [95]:
# Create the LDA model
lda_model = LatentDirichletAllocation(n_components=10, random_state=0)

In [96]:
# Fit the LDA model to the TF-IDF data
topic_distributions = lda_model.fit_transform(tfidf_matrix)

In [97]:
# Get the top N words for each topic associated with "happy" sentiment
def get_top_happy_words(model, feature_names, n_words):
    topic_top_happy_words = []
    for topic_idx, topic in enumerate(model.components_):
        # Sort the word indices in descending order of topic probabilities
        top_word_indices = topic.argsort()[::-1]
        # Find the top N words that are associated with positive sentiment (happy)
        top_happy_words = [feature_names[i] for i in top_word_indices if sentiment_analysis_model.predict(topic_distributions[i]) == "positive"][:n_words]
        topic_top_happy_words.append(top_happy_words)
    return topic_top_happy_words


In [98]:
# Get the feature names (i.e., words) from the vectorizer
feature_names = np.array(vectorizer.get_feature_names_out())

# Number of top words to display for each topic
num_top_words = 10

In [99]:
# Initialize the Sentiment Analysis Model
sentiment_analysis_model = SentimentAnalysisModel()

In [100]:
# Get the top N words associated with "happy" sentiment for each topic
top_happy_words_per_topic = get_top_happy_words(lda_model, feature_names, num_top_words)

In [101]:
# Display the top happy words for each topic
for topic_idx, top_words in enumerate(top_happy_words_per_topic):
    print(f"Topic {topic_idx}: {', '.join(top_words)}")

Topic 0: economy, excellent, meal, airways, time, staff, emirates, kong, aircraft, london
Topic 1: llw, lemak, la, justfly, miri, mi, xna, takeover, experiencia, wilkes
Topic 2: airline, time, airport, ticket, worst, said, las, charge, experience, minutes
Topic 3: air, westjet, cramped, nassau, las, kilimanjaro, victoria, yul, maui, kingston
Topic 4: time, staff, airline, low, baggage, price, aircraft, airport, early, minutes
Topic 5: airport, staff, airline, time, air, minutes, said, late, experience, worst
Topic 6: paolo, ko, wars, nanchang, sukhothai, ketchikan, eaters, laughs, w6, degradation
Topic 7: minh, et, sanya, sarajevo, les, shreveport, krasnodar, pas, dã, mfr
Topic 8: time, staff, economy, excellent, aircraft, drinks, meal, experience, attendants, air
Topic 9: mombasa, svg, dresden, sxm, mke, intelligent, di, disorderly, albury, ayers
