# Airline Tweets Analysis


In this section, we will go through an example analysis of tweets about airlines. We will cover topics in loading data, manipulating data frames, and statistical analysis/ visualization.

## Introducing the Dataset

- The dataset is from the [Airline tweets sentiment dataset](https://www.kaggle.com/datasets/crowdflower/twitter-airline-sentiment?resource=download), which uses tweets that tag one of several major airlines. The dataset also includes information about the tweet location and time, the airline mentioned, and the sentiment of the tweet.


In [None]:
import numpy as np
import pandas as pd
import os


## 1. Import data:

Skills to include: 
- import multiple files in a directory, (with some simple parsing funtions)
- combine them into a single dataframe (pd.concat)
- write a function called parse_files

##  1.1: Load in a single file 
First, let's load in a single file and take a look at it. 

1. Read in the 'Delta.csv' file. What is the relative filepath for the function?
2. How many rows are there? How many columns?
3. Which columns seem most informative? Are there any extra or redundant columns? 
4. Where is airline represented in the csv file?



In [None]:
#load in file for Delta
single_airline = pd.read_csv(...)

It turns out that the airline column is not present in any column, but is in the title of the csv file. Let's extract that information and add it to the dataframe in a column called `airline`. Make sure that it is lower case, and remove the file extension.

In [None]:
## solution
name =  ...
single_airline['airline'] = ...

Now let's make a function `process_file(filepath)` that loads in a file with a filepath and returns the dataframe with the airline column added.

In [None]:
def process_file(filepath):
    df= pd.read_csv(filepath)
    ...
    return(df)


Here's another filepath: `'data/US-Airways.csv'` What will be in the airline column in the output? 


In [None]:
process_file('data/US-Airways.csv')

Let's modify our function to make sure that multi-word airlines have a space rather than a hyphen between words.

In [None]:
def process_file(filepath):
    ...
    return(df)


## 1.2 Load in multiple files

Now that we have a function, let's iterate through all of files in the directory. First loop through and print every file in the `airline_data` directory.


In [None]:
subdir = 'airline_data'
#your code here

We notice that there is a .txt file in the directory, which isn't a pandas dataframe. This will cause an error in the dataframe processing, so let's use an if statement to filter out the .rtf extension. Modify your code from the cell above.

In [None]:
## your code here

Now we have all of the right files to process. Let's build a script that processes each file using the function from above and accumulates it into a list of dataframes. As the first step, Substitute in the `process_file` function from above for the print line. That results in the error below. What is the error? How do we resolve it?


In [None]:
#your code here

Look up the function `os.path.join()` (and recall our File i/o notebook). How can this help dynamically make the filepath? What might be the advantage of this method over string concatenation?

Now, let's update the for-loop to resolve the error from above. 

In [None]:
## your code here


Finally, let's aggregate these into a list and concatenate them into a whole dataframe.

In [None]:
## solution

dflist = []
directory = 'airline_data'

##your loop here

df = ___.____(dflist)
    

Finally, let's take a look at the final data frame.

1. How many rows and columns are there in the total dataframe?
2. How many unique airlines are in the dataset?
3. How many numeric columns are there in the dataset?

## 2. Data processing


## 2.1 Nulls

First, let's summarize the null values in the dataset. First, let's look at which columns have null values in them. Which columns have null values? What are some ways that we could deal with them?

**Hint**: `pd.isnull()` may be a good place to start. 

In [None]:
##your code here

We won't be using any of the columns with null values in the analysis, so we don't need to drop any rows from this dataset. 

Let's drop the columns: `tweet_id`, `airline_sentiment_confidence`,`negativereason_confidence`,
`airline_sentiment_gold`,`airline_sentiment_gold`,`tweet_coord`,
`tweet_location`,`user_timezone`

This will make the dataset more manageable for further analysis.

In [None]:
columns_to_drop = ['Unnamed: 0','tweet_coord','tweet_id','user_timezone',
         'tweet_created','tweet_location','negativereason_gold',
        'airline_sentiment_gold']
list(df)
#your code here
df.____

## 2.2 Feature extraction

Now let's do some basic preprocessing on the data. First, let's look at the first few rows of the dataframe. 


In [None]:
df.head(3)

Let's do a couple of simple feature extraction on the text data, including the number of words. Let's make four new columns:
1. `word_count`: number of words in each tweet
2. `hashtags` : count the number of '#' symbols
3. `mentions` : count number of '@' symbols
4. one other text feature (your choice): for example number of capital words, links, or punctuation like '!', '?', etc. 


In [None]:
df['word_count'] = ...
df['hashtags'] = ...
df['mentions'] = ...

# final one your choice

Next steps in text preprocessing would often use tokenization or vectorization on tweets, to convert the words themselves to numerical data for preprocessing. If you are interested, check out the Python Text Analysis workshop! 


## 2.3 Subset tweets

How many sentiment types are there in the dataframe? 

For our exploratory analysis, let's start by looking just at postive/negative tweets.

1. Subset the dataframe
2. What proportion of the tweets have a positive sentiment?


What is the condition that we would use to subset the dataframe? Subset the dataframe for non-neutral tweets and save it to a dataframe called `pos_neg_df`

In [None]:
pos_neg_df = ...

The `airline_sentiment` column has the terms `positive` and `negative` in it. Let's change them to a numerical column, where 1 = positive, and 0 = negative. One way to do it is to select a subset of the dataframe, then assign a new value to that subset.

In [None]:
##
print(pos_neg_df['airline_sentiment'].unique())
pos_neg_df.loc[...,...] = ...
pos_neg_df.loc[...,...] = ...
print(pos_neg_df['airline_sentiment'].unique())


# 3 Exploratory analysis

##  3.1 Most common users, most frequent airlines

Let's look at the users tweeting at the airlines. 

1. How many unique users are there in the dataset? 
2. Who tweeted the most about airlines in this dataset? (**Hint**: consider df.value_counts())
3. Choose one of the users with the top five most tweets. Which airline are they tweeting about?

**Hint**: Users are recorded in the `name` column


In [None]:
## your code here


This format doesn't give a great idea of the overall distribution of the data. Let's plot this data in a histogram using `pd.plot`. How would I add a title and x-axis label to the plot?

In [None]:
## your code here


##  3.2 Most common negative reasons for tweets

Now let's look at the `negativereasons` column. This column summarizes what topic the user is tweeting about for negative tweets. 

1. How many tweets are about each reason? Sort these from lowest frequency to highest frequency. Which reason is the most common, and which is the least?
2. Make a bar plot from the frequency counts. Add a title and a y-label. 

**Bonus**: Add additional customizations to the plot


In [None]:
## your code here.

## 3.3 Are negative tweets longer than positive tweets?

Let's take a look at the negative and positive tweets. We are interested in the question of whether negative tweets are longer than positive tweets. Let's test this with a t-test.

1. Subset the data into positive and negative tweets
2. Select the `word_count` column
3. Calculate the mean word count for each column. Which mean is higher?
3. Use a t-test to compare the two sets of values from (2). What is the p-value of the result? 
4. Plot a histogram layer for both positive and negative tweet word counts. What do you notice about the distribution?

**Hint**: Refer to the statsmodels notebook from Day 3 for an example!

In [None]:
#subset dataframe

#ttest
res = sm.stats.ttest_ind(...)


#plot (kind = 'hist')


## 3.4 Linear regression of tweet length

Let's use a linear regression to look at other predictors of tweet length. 
Steps:
1. Select the numeric columns and save it as `X` from the dataframe (except wordcount)
2. Select word_count column and save as `y`
3. Set up a linear regression and fit it to the data
4. Interpret the model summary


**Bonus**: How many lines of code did it take? Can you shorten it?

In [None]:
## solution

X = ...
y = ...
model=...

model.summary()

## Next Steps

This notebook took us through importing multiple csv files, data manipulation, and some basic visualizations and analysis of data. If you were working on this dataset, what would you potentially do next? It could be either an analysis, a new feature to include, a visualization that might help represent the data, etc.