# Project
Pick either classification or logistic regression (A for me). Whichever you choose, you need to write a short 200-to-500 word summary of your project along with the findings. 
#Submit all of your content - code, data, words - as a GitHub repository. Your text should be written in markdown as a README.md.

# Classification

Obtain 1000+ things. You can get them via scraping, using an API, or even downloading a few large texts and using .split(".") to break them into sentences. Either text or numeric is fine.

If unlabeled, label at least 100 of them and write a classifier to label the rest.

If labeled, write a classifier to automatically classify them.

Try several classifiers to find the 'best' results according to accuracy score and confusion matrix.

Find the most important features.

In [1]:
import pandas as pd



In [2]:
df = pd.read_csv('2015_Airline_Tweets.csv')
df.head()

Unnamed: 0,tweet_id,airline_sentiment,airline_sentiment_confidence,negativereason,negativereason_confidence,airline,airline_sentiment_gold,name,negativereason_gold,retweet_count,text,tweet_coord,tweet_created,tweet_location,user_timezone
0,570306133677760513,neutral,1.0,,,Virgin America,,cairdin,,0,@VirginAmerica What @dhepburn said.,,2015-02-24 11:35:52 -0800,,Eastern Time (US & Canada)
1,570301130888122368,positive,0.3486,,0.0,Virgin America,,jnardino,,0,@VirginAmerica plus you've added commercials t...,,2015-02-24 11:15:59 -0800,,Pacific Time (US & Canada)
2,570301083672813571,neutral,0.6837,,,Virgin America,,yvonnalynn,,0,@VirginAmerica I didn't today... Must mean I n...,,2015-02-24 11:15:48 -0800,Lets Play,Central Time (US & Canada)
3,570301031407624196,negative,1.0,Bad Flight,0.7033,Virgin America,,jnardino,,0,@VirginAmerica it's really aggressive to blast...,,2015-02-24 11:15:36 -0800,,Pacific Time (US & Canada)
4,570300817074462722,negative,1.0,Can't Tell,1.0,Virgin America,,jnardino,,0,@VirginAmerica and it's a really big bad thing...,,2015-02-24 11:14:45 -0800,,Pacific Time (US & Canada)


## Determining US airlines Twitter users mentioned the most

In [3]:
df.airline.value_counts()

United            3822
US Airways        2913
American          2759
Southwest         2420
Delta             2222
Virgin America     504
Name: airline, dtype: int64

## Determining sentiments Twitter users expressed the most wrt: US airlines

In [5]:
df.airline_sentiment.value_counts()

negative    9178
neutral     3099
positive    2363
Name: airline_sentiment, dtype: int64

In [4]:
df['is_positive'] = (df.airline_sentiment == 'positive').astype(int)
df.is_positive.value_counts()

0    12277
1     2363
Name: is_positive, dtype: int64

In [20]:
airline_sentiment_df = df.airline_sentiment.dropna()
airline_sentiment_df.value_counts()

negative    9178
neutral     3099
positive    2363
Name: airline_sentiment, dtype: int64

In [24]:
df.airline_sentiment.value_counts(dropna=False)

negative    9178
neutral     3099
positive    2363
Name: airline_sentiment, dtype: int64

In [7]:
df.airline_sentiment = df.airline_sentiment.astype('U')

In [13]:
df.negativereason = df.negativereason.astype('string')

In [21]:
from sklearn.feature_extraction.text import TfidfVectorizer
# Make a vectorizer
vectorizer = TfidfVectorizer()

# Learn and count the words in df.content
matrix = vectorizer.fit_transform(df.airline_sentiment)

# Convert the matrix of counts to a dataframe
words_df = pd.DataFrame(matrix.toarray(),
                        columns=vectorizer.get_feature_names())
words_df

Unnamed: 0,negative,neutral,positive
0,0.0,1.0,0.0
1,0.0,0.0,1.0
2,0.0,1.0,0.0
3,1.0,0.0,0.0
4,1.0,0.0,0.0
...,...,...,...
14635,0.0,0.0,1.0
14636,1.0,0.0,0.0
14637,0.0,1.0,0.0
14638,1.0,0.0,0.0


In [22]:
X = words_df
y = df.is_positive

In [23]:

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y)

### First, a RandomForestClassifier

In [28]:
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier()
clf.fit(X_train, y_train)

RandomForestClassifier()

In [29]:
clf.score(X_test, y_test)

1.0

In [31]:
df.is_positive.value_counts(normalize=True)

0    0.838593
1    0.161407
Name: is_positive, dtype: float64

In [32]:
from sklearn.metrics import confusion_matrix

y_true = y_test
y_pred = clf.predict(X_test)
matrix = confusion_matrix(y_true, y_pred)

label_names = pd.Series(['positive', 'neutral'])
pd.DataFrame(matrix,
     columns='Predicted ' + label_names,
     index='Is ' + label_names)

Unnamed: 0,Predicted positive,Predicted neutral
Is positive,3059,0
Is neutral,0,601


### And then a Perceptron...

In [33]:
from sklearn.linear_model import Perceptron

clf = Perceptron(max_iter=4000)
clf.fit(X_train, y_train)

Perceptron(max_iter=4000)

In [34]:
clf.score(X_test, y_test)

1.0

In [35]:
from sklearn.metrics import confusion_matrix

y_true = y_test
y_pred = clf.predict(X_test)
matrix = confusion_matrix(y_true, y_pred)

label_names = pd.Series(['positive', 'neutral'])
pd.DataFrame(matrix,
     columns='Predicted ' + label_names,
     index='Is ' + label_names)

Unnamed: 0,Predicted positive,Predicted neutral
Is positive,3059,0
Is neutral,0,601


### Peering into the negatives to find some more features

In [36]:
df['is_negative'] = (df.airline_sentiment == 'negative').astype(int)
df.is_negative.value_counts()

1    9178
0    5462
Name: is_negative, dtype: int64

In [41]:
negative_df = df[df.airline_sentiment.str.contains("negative", na=False)]
negative_df.head()

Unnamed: 0,tweet_id,airline_sentiment,airline_sentiment_confidence,negativereason,negativereason_confidence,airline,airline_sentiment_gold,name,negativereason_gold,retweet_count,text,tweet_coord,tweet_created,tweet_location,user_timezone,is_positive,is_negative
3,570301031407624196,negative,1.0,Bad Flight,0.7033,Virgin America,,jnardino,,0,@VirginAmerica it's really aggressive to blast...,,2015-02-24 11:15:36 -0800,,Pacific Time (US & Canada),0,1
4,570300817074462722,negative,1.0,Can't Tell,1.0,Virgin America,,jnardino,,0,@VirginAmerica and it's a really big bad thing...,,2015-02-24 11:14:45 -0800,,Pacific Time (US & Canada),0,1
5,570300767074181121,negative,1.0,Can't Tell,0.6842,Virgin America,,jnardino,,0,@VirginAmerica seriously would pay $30 a fligh...,,2015-02-24 11:14:33 -0800,,Pacific Time (US & Canada),0,1
15,570282469121007616,negative,0.6842,Late Flight,0.3684,Virgin America,,smartwatermelon,,0,@VirginAmerica SFO-PDX schedule is still MIA.,,2015-02-24 10:01:50 -0800,"palo alto, ca",Pacific Time (US & Canada),0,1
17,570276917301137409,negative,1.0,Bad Flight,1.0,Virgin America,,heatherovieda,,0,@VirginAmerica I flew from NYC to SFO last we...,,2015-02-24 09:39:46 -0800,this place called NYC,Eastern Time (US & Canada),0,1


In [42]:
df['negativereason'].value_counts()

Customer Service Issue         2910
Late Flight                    1665
Can't Tell                     1190
Cancelled Flight                847
Lost Luggage                    724
Bad Flight                      580
Flight Booking Problems         529
Flight Attendant Complaints     481
longlines                       178
Damaged Luggage                  74
Name: negativereason, dtype: Int64

In [44]:
df['negativereason'] = df['negativereason'].dropna

In [48]:
negative_df['brutal_tweets'] = negative_df['text'].str.contains("bad")
negative_df.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  negative_df['brutal_tweets'] = negative_df['text'].str.contains("bad")


Unnamed: 0,tweet_id,airline_sentiment,airline_sentiment_confidence,negativereason,negativereason_confidence,airline,airline_sentiment_gold,name,negativereason_gold,retweet_count,text,tweet_coord,tweet_created,tweet_location,user_timezone,is_positive,is_negative,brutal_tweets
3,570301031407624196,negative,1.0,Bad Flight,0.7033,Virgin America,,jnardino,,0,@VirginAmerica it's really aggressive to blast...,,2015-02-24 11:15:36 -0800,,Pacific Time (US & Canada),0,1,False
4,570300817074462722,negative,1.0,Can't Tell,1.0,Virgin America,,jnardino,,0,@VirginAmerica and it's a really big bad thing...,,2015-02-24 11:14:45 -0800,,Pacific Time (US & Canada),0,1,True
5,570300767074181121,negative,1.0,Can't Tell,0.6842,Virgin America,,jnardino,,0,@VirginAmerica seriously would pay $30 a fligh...,,2015-02-24 11:14:33 -0800,,Pacific Time (US & Canada),0,1,True
15,570282469121007616,negative,0.6842,Late Flight,0.3684,Virgin America,,smartwatermelon,,0,@VirginAmerica SFO-PDX schedule is still MIA.,,2015-02-24 10:01:50 -0800,"palo alto, ca",Pacific Time (US & Canada),0,1,False
17,570276917301137409,negative,1.0,Bad Flight,1.0,Virgin America,,heatherovieda,,0,@VirginAmerica I flew from NYC to SFO last we...,,2015-02-24 09:39:46 -0800,this place called NYC,Eastern Time (US & Canada),0,1,False


In [49]:
#Americans are really angry...
negative_df['text'].value_counts()

@AmericanAir why am I continually getting put on hold by painfully inexperienced people when calling your Platinum desk?!                         2
@AmericanAir it's not just frustrating--it was PAID for! how do we get a refund?                                                                  2
@AmericanAir this delayed bag was for my friend Lisa Pafe. She got her bag after 3 days in Costa Rica. Issue no updates on your system.           2
@AmericanAir if by near the gate you mean sitting on the plane for almost 2 hours, then yeah.                                                     2
@AmericanAir I'm frustrated by all of the @USAirways attitude toward #ExecPlat members. #thenewamerican                                           2
                                                                                                                                                 ..
@united plz don't advertise wifi if it's not gonna work thanks #worstflightever                                 

In [50]:
negative_df['text'].head()

3     @VirginAmerica it's really aggressive to blast...
4     @VirginAmerica and it's a really big bad thing...
5     @VirginAmerica seriously would pay $30 a fligh...
15        @VirginAmerica SFO-PDX schedule is still MIA.
17    @VirginAmerica  I flew from NYC to SFO last we...
Name: text, dtype: object