### Step 1: Load and Explore the Dataset

I'll begin by loading the dataset into a pandas DataFrame and taking a quick look at the first few rows.


In [1]:
import pandas as pd

# Load the dataset
reviews_df = pd.read_csv('airlines_reviews.csv')

# Display the first few rows of the dataset
reviews_df.head()


Unnamed: 0,review
0,The service was excellent. The cabin staff we...
1,We have had some torrid experiences with BA -...
2,We had a flight from ZRH to SFO via LHR. The l...
3,London to Paris. I wish that they would updat...
4,JFK to LHR. Empty check in and priority securi...


### Step 2: Information Retrieval
1. **Sentiment Analysis**: We'll gauge the sentiment of each review. This can be broadly categorized as positive, negative, or neutral.
2. **Category of Complaints**: By using keyword extraction or topic modeling, we can identify common categories of complaints or feedback.
3. **Urgency**: While determining urgency directly from the reviews might be challenging, we can infer urgency by looking for specific keywords or phrases that indicate immediate issues or concerns.
4. **Risks**: We can identify potential risks by searching for keywords that might indicate safety or health concerns.
5. **Other Information**: Depending on the reviews' content, we might identify other patterns or keywords that can provide additional insights.

### Step 3: Define a Pydantic Model
We'll define a Pydantic model to structure the information we retrieve from the reviews. This will ensure that the extracted data is validated and serialized in a consistent manner.

In [2]:
from pydantic import BaseModel
from typing import Optional, List

class ReviewInfo(BaseModel):
    sentiment: str  
    categories: List[str]  
    urgency: Optional[str]  
    risks: Optional[List[str]]  
    other_info: Optional[str]

ReviewInfo.schema()


{'title': 'ReviewInfo',
 'type': 'object',
 'properties': {'sentiment': {'title': 'Sentiment', 'type': 'string'},
  'categories': {'title': 'Categories',
   'type': 'array',
   'items': {'type': 'string'}},
  'urgency': {'title': 'Urgency', 'type': 'string'},
  'risks': {'title': 'Risks', 'type': 'array', 'items': {'type': 'string'}},
  'other_info': {'title': 'Other Info', 'type': 'string'}},
 'required': ['sentiment', 'categories']}

The `ReviewInfo` Pydantic model is defined as follows:

- **sentiment**: This represents the sentiment of the review and can take values like "positive", "negative", or "neutral".
- **categories**: This is a list capturing the various categories or topics mentioned in the review. For instance, categories could include "in-flight service", "baggage handling", "ticketing", etc.
- **urgency**: This field indicates the urgency of the feedback, with possible values being "high", "medium", or "low". It's optional since not all reviews might have an evident urgency.
- **risks**: This is a list of potential risks identified in the review, such as "safety concern" or "health issue". This field is also optional.
- **other_info**: Any other significant information extracted from the review will be stored here. This is an optional field.

### Step 4: Sentiment Analysis

We'll use a simple approach to gauge sentiment: 
- If the review contains more positive words than negative words, we'll label it as "positive".
- If the review contains more negative words than positive words, we'll label it as "negative".
- Otherwise, we'll label it as "neutral".

Let's generate sentiment labels for the reviews.

In [4]:
from textblob import TextBlob

def get_sentiment(text):
    """Determine the sentiment of the text using TextBlob."""
    analysis = TextBlob(text)
    # Classifying sentiment
    if analysis.sentiment.polarity > 0:
        return "positive"
    elif analysis.sentiment.polarity < 0:
        return "negative"
    else:
        return "neutral"

# Applying the sentiment function to the reviews
reviews_df["sentiment"] = reviews_df["review"].apply(get_sentiment)

# Displaying the distribution of sentiments
reviews_df["sentiment"].value_counts()


positive    365
negative    134
neutral       1
Name: sentiment, dtype: int64

From the sentiment analysis:

- There are 365 positive reviews.
- There are 134 negative reviews.
- There is 1 neutral review.

Next, we'll focus on identifying the categories of complaints or feedback within the reviews. For this, we'll utilize keyword extraction to pinpoint common themes or topics. 

### Step 5: Extract Categories

To start, we'll identify a set of potential keywords or phrases that represent common categories of feedback in the airline industry. We'll then search for these keywords within the reviews to categorize them.



In [5]:
# Define a list of potential categories and associated keywords
categories_keywords = {
    "in-flight service": ["cabin", "service", "in-flight", "meal", "food", "beverage"],
    "baggage handling": ["baggage", "luggage", "lost", "damaged", "delayed"],
    "ticketing": ["ticket", "booking", "reservation", "price", "cost", "charge"],
    "seating": ["seat", "legroom", "space", "comfort", "recline", "position"],
    "staff behavior": ["staff", "crew", "attendant", "rude", "friendly", "helpful"],
    "delays": ["delay", "late", "wait", "time", "hour", "postponed"],
    "entertainment": ["entertainment", "screen", "movie", "audio", "headphone"],
    "safety": ["safety", "emergency", "seatbelt", "landing", "takeoff"],
    "cleanliness": ["clean", "dirty", "maintained", "hygiene", "sanitary"]
}

def extract_categories(text):
    """Extract categories from the text based on the presence of specific keywords."""
    matched_categories = []
    for category, keywords in categories_keywords.items():
        if any(keyword in text.lower() for keyword in keywords):
            matched_categories.append(category)
    return matched_categories

# Extract categories for each review
reviews_df["categories"] = reviews_df["review"].apply(extract_categories)

# Displaying some sample reviews with extracted categories
reviews_df[["review", "categories"]].head()


Unnamed: 0,review,categories
0,The service was excellent. The cabin staff we...,"[in-flight service, seating, staff behavior]"
1,We have had some torrid experiences with BA -...,"[in-flight service, baggage handling, seating,..."
2,We had a flight from ZRH to SFO via LHR. The l...,"[in-flight service, staff behavior, entertainm..."
3,London to Paris. I wish that they would updat...,"[ticketing, seating]"
4,JFK to LHR. Empty check in and priority securi...,"[in-flight service, staff behavior, delays]"


We have successfully extracted the categories from the reviews based on the presence of specific keywords. For example:

- The first review mentions the "service," "cabin staff," and "seating," so it's categorized under "in-flight service," "seating," and "staff behavior."
- The second review touches on multiple aspects like "in-flight service," "baggage handling," "seating," and "staff behavior."

Next, let's focus on determining the **urgency** of the feedback. We'll attempt to infer urgency by looking for specific keywords or phrases that indicate immediate issues or concerns.







### Step 6: Determine Urgency

We'll classify urgency into three levels: "high," "medium," and "low." We'll define a set of keywords for each level and then categorize the reviews based on these keywords.

In [6]:
# Define keywords for different levels of urgency
urgency_keywords = {
    "high": ["urgent", "immediate", "asap", "right away", "critical", "emergency"],
    "medium": ["soon", "shortly", "in a while", "important", "necessary"],
    "low": ["whenever", "later", "in the future", "eventually"]
}

def determine_urgency(text):
    """Determine the urgency level of the text based on the presence of specific keywords."""
    for urgency_level, keywords in urgency_keywords.items():
        if any(keyword in text.lower() for keyword in keywords):
            return urgency_level
    return None  # If no keywords are found, return None

# Determine urgency for each review
reviews_df["urgency"] = reviews_df["review"].apply(determine_urgency)

# Displaying some sample reviews with determined urgency
reviews_df[["review", "urgency"]].head()


Unnamed: 0,review,urgency
0,The service was excellent. The cabin staff we...,
1,We have had some torrid experiences with BA -...,
2,We had a flight from ZRH to SFO via LHR. The l...,
3,London to Paris. I wish that they would updat...,
4,JFK to LHR. Empty check in and priority securi...,


In [8]:
reviews_df['urgency'].value_counts()

low       51
medium    28
high       9
Name: urgency, dtype: int64

In [9]:
reviews_df.shape

(500, 4)

In [10]:
reviews_df.isna().sum()

review          0
sentiment       0
categories      0
urgency       412
dtype: int64