# Airline Tweets Analysis


In this section, we will go through an example analysis of tweets about airlines. We will bring together the basic programming, loading data, and statistical analysis/ visualization techniques from Parts 1-3 of this workshop to analyze airline tweets. 

## Introducing the Dataset

The dataset is from the [Airline tweets sentiment dataset](https://www.kaggle.com/datasets/crowdflower/twitter-airline-sentiment?resource=download), which contains tweets that tag one of several major airlines. The dataset also includes information about the tweet location and time, the airline mentioned, and the sentiment of the tweet.


First, let's import the packages to use in this analysis:


In [None]:
import numpy as np
import pandas as pd
import os
import statsmodels.api as sm


## 1. Import data:

First let's import in our data set. The data are located in the `airline_data` subfolder of this directory. Let's see what is in that subfolder using `os.listdir()`.

In [None]:
os.listdir('airline_data')

##  1.1: Load in a single file 
First, let's load in a single file and take a look at it. 

1. Read in the `Delta.csv` file as a `pandas` object.
2. How many rows are there? How many columns?
3. Which columns seem most informative? Are there any extra or redundant columns? 
4. Where is airline represented in the csv file?

In [None]:
## solution
single_airline = pd.read_csv('airline_data/Delta.csv')

single_airline.head()

It turns out that the airline column is not present in any column, but is in the title of the csv file. Let's extract that information and add it to the dataframe in a column called `airline`.

In [None]:
## solution
filename = 'airline_data\Delta.csv'.split('\\')[1]
print(filename)
name = filename.split('.')[0]
print(name)
single_airline['airline'] = name

Now let's make a function `process_file(filepath)` that loads in a file with a filepath and returns the dataframe with the airline column added. Most of the steps should be already done in the cell above, now we just need to add it into the function wrapper. 

In [None]:
## solution
def process_file(filepath):
    df= pd.read_csv(filepath)
    df['airline'] = filepath.split('\\')[1].split('.')[0].lower()
    
    return(df)


Here's another filepath: `'data/US-Airways.csv'` What will be in the airline column in the output? 


In [None]:
process_file('airline_data\\US-Airways.csv')

In the future we may want to modify our `process_file` function to make sure that multi-word airlines have a space rather than a hyphen between words, but for now we will move forward in the analysis.

## 1.2 Load in multiple files

Now that we have a function, let's iterate through all of files in the directory. 

First, fill in the blanks below to loop through and print every file in the `airline_data` directory.

In [None]:
## solution
directory = 'airline_data'
for file in os.listdir(directory):
    print(file)

We notice that there is a `.txt` file in the directory, which isn't a `pandas` dataframe. This will cause an error in the dataframe processing, so let's use an if statement to filter out the `.txt` extension. 


First let's write an expression that evaluates to `True` for `.csv` files and `False` for `.txt` files

In [None]:
test_csv='delta.csv' #expression should evaluate True
test_txt='delta.txt' # evaluate false

test_csv.endswith('csv')
test_txt.endswith('csv')

In [None]:
## solution
directory = 'airline_data'
for file in os.listdir(directory):
    if file.split('.')[1]=='csv':
        print(file)

Now we have all of the right files to process. Let's build a script that processes each file using the function from above and accumulates it into a list of dataframes. As the first step, I substitute in the `process_file` function from above. That results in the error below. What is the error? How might we resolve it?

In [None]:
directory = 'airline_data'
for file in os.listdir(directory):
    if file.split('.')[1]=='csv':
        process_file(file)

Look up the function `os.path.join()` (and recall our File i/o notebook). How can this help dynamically make the filepath? What might be the advantage of this method over string concatenation? 

Update our loop to resolve the error from above.


In [None]:
## solution

directory = 'airline_data'
for file in os.listdir(directory):
    if file.split('.')[1]=='csv':
        process_file(os.path.join(directory,file))

Finally, we want to take all of the airline DataFrames and put them together into a single `DataFrame`. 

*Without coding* (for now), write out the steps for the file processing code, including the aggregation steps described above. The first couple of steps are filled out for you:

1. Get a list of files in a directory
2. For each file in the directory
    1. If the file ends in csv
    2. Join the file and directory name and process file
    3. Append the dataframe to a list
3. Concatenate the list into a single dataframe

Now let's use these steps to aggregate these into a list and concatenate them into a whole dataframe.

In [None]:
## solution

dflist = []
directory = 'airline_data'
for file in os.listdir(directory):
    if file.split('.')[1]=='csv':
        full_path = os.path.join(directory,file)
        print(full_path)
        dflist.append(process_file(full_path))
        
df = pd.concat(dflist)
    

Finally, let's take a look at the final data frame.

1. How many rows and columns are there in the total dataframe?
2. How many unique airlines are in the dataset?
3. How many numeric columns are there in the dataset?

In [None]:
## solution
df.shape

df['airline'].nunique()

df.head()

## 2. Data processing


## 2.1 Nulls

First, let's summarize the null values in the dataset. First, let's look at which columns have null values in them. Which columns have null values? What are some ways that we could deal with them?

**Hint**: `pd.isnull()` may be a good place to start. 

In [None]:
##solution
df.isnull().sum(axis=0)

We won't be using any of the columns with null values in the analysis, so we don't need to drop any rows from this dataset. 

Let's drop the columns: `tweet_id`, `airline_sentiment_confidence`,`negativereason_confidence`,
`airline_sentiment_gold`,`airline_sentiment_gold`,`tweet_coord`,
`tweet_location`,`user_timezone`

This will make the dataset more manageable for further analysis.

In [None]:
##solution
columns_to_drop = ['Unnamed: 0','tweet_coord','tweet_id','user_timezone',
         'tweet_created','tweet_location','negativereason_gold',
        'airline_sentiment_gold']
df.drop(columns_to_drop,axis=1,inplace=True)

## 2.2 Feature extraction

Now let's do some basic preprocessing on the data. First, let's look at the first few rows of the dataframe. 


In [None]:
df.head(3)

Let's do a couple of simple feature extraction on the text data, including the number of words. Let's make three new columns:
1. `word_count`: number of words in each tweet
2. `mentions` : count number of '@' symbols
3. one other text feature (your choice): for example number of capital words, links, or punctuation like '!', '?', etc. 


**Hint:** Remember that you can use `Series.str` to access vectorized string functions!


In [None]:

df['word_count'] = df['text'].str.split(' ').str.len()

df['mentions'] = df['text'].str.count('@')


# final one your choice

Next steps in text preprocessing would often use tokenization or vectorization on tweets, to convert the words themselves to numerical data for preprocessing. If you are interested, check out the Python Text Analysis workshop! 


### 2.3 Subset Tweets

**Question:** How many sentiment types are there in the DataFrame? 

For our exploratory analysis, let's start by looking just at postive/negative tweets.

1. Subset the dataframe
2. What proportion of the tweets have a positive sentiment?

What is the condition that we would use to subset the dataframe? Subset the dataframe for non-neutral tweets and save it to a dataframe called `pos_neg_df`.
**Hint:** You can use `!=` to check for all values not equal to a certain value

In [None]:
##solution

df['airline_sentiment'].unique()
pos_neg_df = df.loc[(df['airline_sentiment']!= 'neutral'),:]

The `airline_sentiment` column has the terms `positive` and `negative` in it. Let's change them to a numerical column, where 1 = positive, and 0 = negative.

# 3 Exploratory analysis

##  3.1 Most common users, most frequent airlines

Let's look at the users tweeting at the airlines. 

1. How many unique users are there in the dataset? 
2. Who tweeted the most about airlines in this dataset? (**Hint**: consider df.value_counts())
3. Choose one of the users with the top five most tweets. Which airline are they tweeting about?

**Hint**: Users are recorded in the `name` column


In [None]:
## solution

pos_neg_df['name'].unique()
users = pos_neg_df.value_counts('name')
pos_neg_df.loc[pos_neg_df['name'] == 'otisday','airline'].values[0]

This format doesn't give a great idea of the overall distribution of the data. Let's plot this data in a histogram using `pd.plot`. How would I add a title and x-axis label to the plot?

In [None]:
## solution
df['name'].value_counts().plot(kind='hist')

In [None]:
##solution
res = sm.stats.ttest_ind(pos_neg_df.loc[pos_neg_df['airline_sentiment']==0,'word_count'],
                         pos_neg_df.loc[pos_neg_df['airline_sentiment']==1,'word_count'])
res

pos_neg_df.loc[pos_neg_df['airline_sentiment']==0,'word_count'].plot(kind='hist')
pos_neg_df.loc[pos_neg_df['airline_sentiment']==1,'word_count'].plot(kind='hist')

### Visualization

In [None]:
df.columns

In [None]:
df['word_count'].hist()

### 3.3 Linear Regression of Tweet Length

Now that we undersdand Let's use a linear regression to look at other predictors of tweet length. Complete the steps:

1. Select the numeric columns 'airline_sentiment','airline_sentiment_confidence','retweet_count','hashtags','mentions', and save it as `X` (except wordcount)
2. Select the word_count column and save as `y`
3. Set up a linear regression and fit it to the data using `sm.OLS()`
4. Interpret the model summary

**Bonus**: How many lines of code did it take? Can you shorten it?

In [None]:
## solution
X = np.array(pos_neg_df[['airline_sentiment','airline_sentiment_confidence','retweet_count','hashtags','mentions']],dtype=float)

y = np.array(pos_neg_df['word_count'],dtype=float)

model=sm.OLS(y,X).fit()

model.summary()

## Bonus: Are negative tweets longer than positive tweets?

Let's take a look at the negative and positive tweets. We are interested in the question of whether negative tweets are longer than positive tweets. Let's test this with a t-test.

1. Subset the data into positive and negative tweets
2. Select the `word_count` column
3. Calculate the mean word count for each column. Which mean is higher?
3. Use a t-test to compare the two sets of values from (2). What is the p-value of the result? 
4. Plot a histogram layer for both positive and negative tweet word counts. Choose an appropriate value for `bins`. What do you notice about the distribution?

**Hint**: Refer to the statsmodels notebook from Day 3 for an example!

Let's use a linear regression to look at other predictors of tweet length. 
Steps:
1. Select the numeric columns and save it as `X` from the dataframe (except wordcount)
2. Select word_count column and save as `y`
3. Set up a linear regression and fit it to the data
4. Interpret the model summary


**Bonus**: How many lines of code did it take? Can you shorten it?

## Next Steps

This notebook took us through importing multiple csv files, data manipulation, and some basic visualizations and analysis of data. If you were working on this dataset, what would you potentially do next? It could be either an analysis, a new feature to include, a visualization that might help represent the data, etc.