# Baseline model

This is the entrypoint for the competition, it:

* Reads data from tweets' CSV files
* Computes Bag of Words (BoW) from textual representations (tweets text)
* Tests two models to find out which performs better
* Predicts classes for the submission/benchmark tweets
* Generates a suitable CSV for Kaggle InClass

## Data representation

The function `obtain_data_representation` performs the BoW transformation over the training set and applies it to both the train and test set.

If no test set is provided, the input DataFrame is split into both train and test, 75% and 25% of the data respectively. This is done so as to be able to obtain an accuracy score, which will be the evaluation metric on Kaggle.

BoW is computed through `CountVectorizer` class of `sklearn`, restricting it to at most 200 features. The process of finding the best words is done by the `fit` method, whereas transforming the text to numerical vectors (using the learnt features) is done by `transform`. Lastly, `fit_transform` does in a single step the learning and transforming process.

In [13]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split


def obtain_data_representation(df, test=None):
    # If there is no test data, split the input
    if test is None:
        # Divide data in train and test
        train, test = train_test_split(df, test_size=0.25)
        df.airline_sentiment = pd.Categorical(df.airline_sentiment)
    else:
        # Otherwise, all is train
        train = df
    
    mylist = ['the', 'a', 'got', 'of', 'is', 'are', 'we', 'you', 'I', 'me', 'them']
    # Create a Bag of Words (BoW), by using train data only
#    cv = CountVectorizer(max_features=200, stop_words=mylist)
    cv = CountVectorizer(analyzer='word', ngram_range=(1,2), token_pattern=r'[^@]\b\w+\b', min_df=1, stop_words='english')

    x_train = cv.fit_transform(train['text'])
    
    y_train = train['airline_sentiment'].values
        
    #print(cv.vocabulary_)
    #print(cv.stop_words_)
    
    # Obtain BoW for the test data, using the previously fitted one
    x_test = cv.transform(test['text'])
    try:
        y_test = test['airline_sentiment'].values
    except:
        # It might be the submision file, where we don't have target values
        y_test = None
        
    return {
        'train': {
            'x': x_train,
            'y': y_train
        },
        'test': {
            'x': x_test,
            'y': y_test
        }
    }

## Model training

Thought this function might seem strange at first, the only thing to know is that training an `sklearn` model is always done the same way:

```python
# 1. Create the model
model = BernoulliNB()

# 2. Train with some data, where `x` are features and
#    `y` is the target category
model.fit(x, y)

# 3. Predict new categories for test data (with which we
#    have not trained!)
y_pred = model.predict(test_x)
```

We might also obtain the accuracy score by using the function `accuracy_score`

In [14]:
from sklearn.metrics import accuracy_score

def train_model(dataset, dmodel, *model_args, **model_kwargs):
    # Create a Naive Bayes model
    model = dmodel(*model_args, **model_kwargs)
    
    # Train it
    model.fit(dataset['train']['x'], dataset['train']['y'])
    
    # Predict new values for test
    y_pred = model.predict(dataset['test']['x'])
    
    # Print accuracy score unless its the submission dataset
    if dataset['test']['y'] is not None:
        score = accuracy_score(dataset['test']['y'], y_pred)
        print("Model score is: {}".format(score))

    # Done
    return model, y_pred

In [15]:
import pandas as pd
from sklearn.naive_bayes import BernoulliNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import SGDClassifier


df = pd.read_csv('tweets_public.csv', index_col='tweet_id')
dataset = obtain_data_representation(df)

#print(df)
#print(dataset['train']['x'])

# Train a Bernoulli Naive Bayes
modelNB, _ = train_model(dataset, BernoulliNB)

# Train a K Nearest Neighbors Classifier
modelKN, _ = train_model(dataset, KNeighborsClassifier)

modelSGD, _ = train_model(dataset, SGDClassifier)

Model score is: 0.6347905282331512
Model score is: 0.337431693989071
Model score is: 0.7786885245901639




In [9]:
dataset['train']

{'x': <6588x200 sparse matrix of type '<class 'numpy.int64'>'
 	with 34372 stored elements in Compressed Sparse Row format>,
 'y': array(['negative', 'negative', 'negative', ..., 'negative', 'negative',
        'neutral'], dtype=object)}

## Submit file

Once we have found the best model (BernoulliNB for the above simple test), we can train it with all the data (that is, avoid doing a train/test split) and predict sentiments for the real submission data.

This cell below performs exactly this.

In [10]:
import datetime

def create_submit_file(df_submission, ypred):
    date = datetime.datetime.now().strftime("%m_%d_%Y-%H_%M_%S")
    filename = 'submission_' + date + '.csv'
    
    df_submission['airline_sentiment'] = ypred
    df_submission[['airline_sentiment']].to_csv(filename)
    
    print('Submission file created: {}'.format(filename))
    print('Upload it to Kaggle InClass')

    
# Read submission and retrain with whole data
df_submission = pd.read_csv('tweets_submission.csv', index_col='tweet_id')
# We use df_submision as test, otherwise it would split df in train/test
submission_dataset = obtain_data_representation(df, df_submission)
# Predict for df_submission
_, y_pred = train_model(submission_dataset, BernoulliNB)

# Create submission file with obtained y_pred
create_submit_file(df_submission, y_pred)

{'jetblue': 100, 've': 185, 'know': 104, 'seat': 152, '30': 5, 'flying': 77, 'southwestair': 158, 'did': 53, 'dm': 56, 'hold': 88, 'just': 103, 'united': 180, 'flight': 69, 'home': 89, 'united flight': 181, 'lost': 115, 'great': 85, 'website': 193, 'said': 150, 'flights': 74, 'cancelled': 33, 'flighted': 71, 'cancelled flighted': 35, 'late': 105, 'usairways': 183, 'agents': 12, 'late flight': 106, 'm': 118, 'trying': 178, 'missed': 124, 't': 162, '2': 2, 'says': 151, 'number': 131, 'virginamerica': 186, 'http': 95, 'http t': 96, 'time': 169, 'getting': 81, 'flightr': 73, 'll': 113, 'late flightr': 107, 'americanair': 18, 'flightled': 72, 'phone': 137, 'don': 60, 'online': 132, 'flight cancelled': 70, 'cancelled flightled': 36, 'don t': 61, 'connection': 41, 'delays': 51, 'delayed': 50, 'customer': 43, 'service': 155, 'line': 112, 'people': 135, 'customer service': 44, 'got': 84, 'thanks': 165, 'plane': 138, 'airport': 17, 'help': 87, 'crew': 42, 'nice': 129, 'customers': 45, 'worst': 1

In [41]:
dataset = pd.read_csv('tweets_public.csv')
dataset.head()

Unnamed: 0,tweet_id,airline_sentiment,airline_sentiment_confidence,negativereason,negativereason_confidence,airline,airline_sentiment_gold,name,negativereason_gold,retweet_count,text,tweet_coord,tweet_created,tweet_location,user_timezone
0,569237160886276096,negative,1.0,Can't Tell,0.6543,Delta,,venkatesh_cr,,0,@JetBlue I've been in pricing for 8 years to k...,,2015-02-21 12:48:09 -0800,Austin Texas,Central Time (US & Canada)
1,569267194028298241,negative,1.0,Customer Service Issue,1.0,Southwest,,ChristineFlores,,0,"@SouthwestAir AH - did DM, no reply. On hold n...",,2015-02-21 14:47:30 -0800,,Central Time (US & Canada)
2,569506670189137920,negative,0.6473,Lost Luggage,0.6473,United,,szymanski_t,,0,@united if you lost my belongings then BE HONEST!,,2015-02-22 06:39:05 -0800,,Eastern Time (US & Canada)
3,570293957739081728,negative,1.0,Customer Service Issue,1.0,United,,nate2482,,0,@United the internet is a great thing. I am e...,,2015-02-24 10:47:29 -0800,"Parkersburg, WV",Eastern Time (US & Canada)
4,570212129313316864,neutral,1.0,,,Delta,,elias_rubin,,0,@JetBlue I believe that the website said I cou...,,2015-02-24 05:22:20 -0800,"New York, NY",Pacific Time (US & Canada)


In [49]:
df = pd.read_csv('tweets_public.csv', index_col='tweet_id')
dataset = obtain_data_representation(df)

{' ': 0, '@': 38, 'u': 178, 's': 160, 'a': 42, 'i': 98, 'r': 152, 'w': 189, 'y': 196, '@u': 41, 'us': 184, 'sa': 162, 'ai': 46, 'ir': 105, 'rw': 159, 'wa': 191, 'ay': 54, 'ys': 199, 'j': 108, 'd': 66, 'b': 55, 'f': 82, 'l': 113, 'e': 71, ' @': 2, 'fl': 84, 'le': 117, 'es': 80, 's ': 161, "'": 26, ' w': 21, 'we': 192, 're': 155, 'e ': 72, 'x': 195, 'c': 60, 't': 167, ' e': 7, 'it': 107, 'te': 172, 'ed': 75, 'd ': 67, 'o': 139, ' t': 19, 'to': 175, 'o ': 140, 'h': 90, 'v': 186, ' h': 10, 'ha': 92, 'av': 53, 've': 187, ' y': 22, 'yo': 198, 'ou': 147, 'u ': 179, ' f': 8, 'ly': 122, 'y ': 197, 'wi': 194, 'th': 173, 'h ': 91, ',': 27, ' u': 20, ', ': 28, '!': 23, '! ': 24, 'n': 129, 'wh': 193, 'he': 93, 'en': 78, 'n ': 130, 'il': 102, 'll': 119, 'l ': 114, 'hi': 94, 'is': 106, '?': 36, ' b': 4, 'be': 57, 'p': 149, 'g': 86, ' s': 18, 'ri': 156, 'in': 103, 'ng': 135, 'g ': 87, 'k': 110, 'ea': 73, 'k ': 111, 'm': 123, '@a': 39, 'am': 48, 'me': 126, 'er': 79, 'ic': 100, 'ca': 61, 'an': 49, 'na':

In [50]:
dataset

{'test': {'x': <2196x200 sparse matrix of type '<class 'numpy.int64'>'
  	with 181830 stored elements in Compressed Sparse Row format>,
  'y': array(['negative', 'negative', 'negative', ..., 'negative', 'negative',
         'negative'], dtype=object)},
 'train': {'x': <6588x200 sparse matrix of type '<class 'numpy.int64'>'
  	with 545952 stored elements in Compressed Sparse Row format>,
  'y': array(['positive', 'neutral', 'negative', ..., 'negative', 'neutral',
         'negative'], dtype=object)}}

In [47]:
gp = dataset[dataset['airline_sentiment'] == 'positive']
gn = dataset[dataset['airline_sentiment'] == 'negative']

gp.head()

Unnamed: 0,tweet_id,airline_sentiment,airline_sentiment_confidence,negativereason,negativereason_confidence,airline,airline_sentiment_gold,name,negativereason_gold,retweet_count,text,tweet_coord,tweet_created,tweet_location,user_timezone
14,569088825420607488,positive,1.0,,,US Airways,,WTFloris,,0,"@USAirways Got it, thanks!",,2015-02-21 02:58:43 -0800,Raxacoricofallapatorius,Amsterdam
21,567773445195583488,positive,1.0,,,Southwest,,MatthewJLeBlanc,,0,@SouthwestAir Thanks for the response. Was abl...,,2015-02-17 11:51:52 -0800,"Nashville, TN",Central Time (US & Canada)
22,567847753277120512,positive,1.0,,,United,,dudleywright,,0,@united thanks for not getting my BusinessFirs...,,2015-02-17 16:47:09 -0800,"Johnstown, Ohio",Quito
28,569891440135794688,positive,1.0,,,American,,heatherjpitcher,,0,@AmericanAir Thank You! CC: @packermama1,,2015-02-23 08:08:02 -0800,,Quito
29,567775418612195328,positive,0.6522,,,United,,ColtSTaylor,,0,@united looks like I'm settled in to where I'm...,"[39.85871934, -104.67371484]",2015-02-17 11:59:43 -0800,All Over The World,
