# 3 word weather: what weather state is most important to people?
Here the 3 word weather project, where people tweeted in 3 words to describe the weather will be processed to find out what about the weather people really want to know about.
Using the data from the first week of this project, the rest of the data will be filtered to contain the most commonly occuring words.
These words will then be sorted into categories, and it will then be ascertained in which order within the 3 word format certain categories are likely to appear.
Weights will be assigned depending on the words rate of occurence.
The order of the words will then be analysed based on their weights, and conclusions can be drawn on people's overall views of which weather states are most important.

Importing relevant libraries

In [1]:
import pandas as pd
import numpy as np
import csv
import ast

Importing csv data file as raw_data

In [2]:
raw_data = '/s3/three-word-weather/hack/3ww-all-raw.csv'

Converting to pandas dataframe

In [3]:
raw_df = pd.DataFrame.from_csv(path=raw_data, header=None)

In [4]:
raw_df.head(5)

Unnamed: 0_level_0,1,2
0,Unnamed: 1_level_1,Unnamed: 2_level_1
2018-03-05 11:43:14,b'EdwardAndrews5',"b'#3wordweather cold, dry, windy. Ardersier'"
2018-03-05 10:09:54,b'ScoopLorimer',"b'#3wordweather\nSunny, calm, pleasant\n\nSout..."
2018-03-05 10:02:39,b'ScoopLorimer',"b'#3wordweather\nSunny, calm, pleasant'"
2018-03-04 17:29:42,b'BenThomWood',b'Whatever this means?! It\xe2\x80\x99s rainin...
2018-03-04 14:09:19,b'Miskmask',b'#3WordWeather @metoffice Warming to Rain #cr...


The data is in a binary format. The next function converts this data from binary, then we apply it to the columns in a binary format.

In [5]:
def _parse_bytes(field):
    result = field
    try:
        result = ast.literal_eval(field)
    finally:
        return result.decode() if isinstance(result, bytes) else field

In [6]:
raw_df[2] = raw_df[2].apply(_parse_bytes)
raw_df[1] = raw_df[1].apply(_parse_bytes)

In [7]:
raw_df.head(5)

Unnamed: 0_level_0,1,2
0,Unnamed: 1_level_1,Unnamed: 2_level_1
2018-03-05 11:43:14,EdwardAndrews5,"#3wordweather cold, dry, windy. Ardersier"
2018-03-05 10:09:54,ScoopLorimer,"#3wordweather\nSunny, calm, pleasant\n\nSoutha..."
2018-03-05 10:02:39,ScoopLorimer,"#3wordweather\nSunny, calm, pleasant"
2018-03-04 17:29:42,BenThomWood,Whatever this means?! It’s raining here @Birch...
2018-03-04 14:09:19,Miskmask,#3WordWeather @metoffice Warming to Rain #crawley


There's some residue formatting leftover from binary (\n as spaces in some place), so we now replace this with spaces.

In [8]:
raw_df.replace({'\n': ' '}, inplace=True, regex=True)
raw_df.head(5)

Unnamed: 0_level_0,1,2
0,Unnamed: 1_level_1,Unnamed: 2_level_1
2018-03-05 11:43:14,EdwardAndrews5,"#3wordweather cold, dry, windy. Ardersier"
2018-03-05 10:09:54,ScoopLorimer,"#3wordweather Sunny, calm, pleasant Southam, ..."
2018-03-05 10:02:39,ScoopLorimer,"#3wordweather Sunny, calm, pleasant"
2018-03-04 17:29:42,BenThomWood,Whatever this means?! It’s raining here @Birch...
2018-03-04 14:09:19,Miskmask,#3WordWeather @metoffice Warming to Rain #crawley


To analyse the words only text strings are required from the tweets, there all punctuation is taken out here.

In [220]:
raw_df.replace({'\#': ''}, inplace=True, regex=True)
raw_df.replace({'\?': ''}, inplace=True, regex=True)
raw_df.replace({'\.': ''}, inplace=True, regex=True)
raw_df.replace({'\:': ''}, inplace=True, regex=True)
raw_df.replace({'\,': ''}, inplace=True, regex=True)
raw_df.replace({'\!': ''}, inplace=True, regex=True)
raw_df.replace({'\-': ' '}, inplace=True, regex=True)
raw_df.head(5)

Unnamed: 0_level_0,1,2
0,Unnamed: 1_level_1,Unnamed: 2_level_1
2018-03-05 11:43:14,EdwardAndrews5,3wordweather cold dry windy Ardersier
2018-03-05 10:09:54,ScoopLorimer,3wordweather Sunny calm pleasant Southam warks
2018-03-05 10:02:39,ScoopLorimer,3wordweather Sunny calm pleasant
2018-03-04 17:29:42,BenThomWood,Whatever this means It’s raining here @Birchmo...
2018-03-04 14:09:19,Miskmask,3WordWeather @metoffice Warming to Rain crawley


The data is inconsisten in being lower or upper case, therefore here it is all transposed to lower case to make it formattable.

In [221]:
raw_df = raw_df.apply(lambda x: x.astype(str).str.lower())
raw_df.head(5)

Unnamed: 0_level_0,1,2
0,Unnamed: 1_level_1,Unnamed: 2_level_1
2018-03-05 11:43:14,edwardandrews5,3wordweather cold dry windy ardersier
2018-03-05 10:09:54,scooplorimer,3wordweather sunny calm pleasant southam warks
2018-03-05 10:02:39,scooplorimer,3wordweather sunny calm pleasant
2018-03-04 17:29:42,benthomwood,whatever this means it’s raining here @birchmo...
2018-03-04 14:09:19,miskmask,3wordweather @metoffice warming to rain crawley


In [222]:
raw_df.shape

(3442, 2)

Here some common extra words and phrases are removed, the relevant hashtag and 'and' that was often put between the final two words, again to make the format uniform.

In [223]:
raw_df.replace({'3wordweather': ''}, inplace=True, regex=True)
raw_df.replace({'threewordweather': ''}, inplace=True, regex=True)
raw_df.replace({'and': ''}, inplace=True, regex=True)
raw_df.shape

(3442, 2)

Here twitter names and web links are removed the body text of the tweet.

In [224]:
raw_df = raw_df.replace('\B(@[\w]+)', '', regex=True)
raw_df = raw_df.replace('\B(https//[\S\w]+)', '', regex=True)

In [225]:
#çome back to this bit to make removing web link work

In [226]:
raw_df.head(20)

Unnamed: 0_level_0,1,2
0,Unnamed: 1_level_1,Unnamed: 2_level_1
2018-03-05 11:43:14,edwardrews5,cold dry windy ardersier
2018-03-05 10:09:54,scooplorimer,sunny calm pleasant southam warks
2018-03-05 10:02:39,scooplorimer,sunny calm pleasant
2018-03-04 17:29:42,benthomwood,whatever this means it’s raining here whatth...
2018-03-04 14:09:19,miskmask,warming to rain crawley
2018-03-04 13:03:17,charleesaunt,misty rain cold thirsk northyorkshire
2018-03-04 11:41:24,c130medic,snowing johnstone
2018-03-04 11:34:58,franceshipple1,grey soggy foggy dunstable https//tco/3dcc7a1wbo
2018-03-04 11:00:43,cranesbill1970,snowing white sky todmorden
2018-03-04 10:57:03,copsey_david,cold damp dull


Some tweets did not stick to the format of three nouns describing the weather and instead contained a phrase. Here any tweets containing linking words are dropped from the dataframe, as they are not consistent with the intended analysis.

In [227]:
raw_df = raw_df[raw_df[2].str.contains(" the ") == False]
raw_df = raw_df[raw_df[2].str.contains(" of ") == False]
raw_df = raw_df[raw_df[2].str.contains(" this ") == False]
raw_df = raw_df[raw_df[2].str.contains(" that ") == False]
raw_df = raw_df[raw_df[2].str.contains(" but ") == False]
raw_df = raw_df[raw_df[2].str.contains(" to ") == False]
raw_df = raw_df[raw_df[2].str.contains(" from ") == False]
raw_df = raw_df[raw_df[2].str.contains(" it ") == False]
raw_df = raw_df[raw_df[2].str.contains(" is ") == False]
raw_df = raw_df[raw_df[2].str.contains(" so ") == False]

In [228]:
raw_df.shape

(2852, 2)

Here a new column containing False is created, which will be reassigned to contain True if the tweet was a retweet. This is so that, although we can analyse all tweets with the assumption that any retweets are in exact agreement of opinion of the original social media user, date can also be analysed excluding retweets in case this assumption does not work.

In [229]:
raw_df = pd.concat([raw_df,pd.DataFrame(columns=([3]))])
raw_df[3] = False

In [230]:
raw_df.head(2)

Unnamed: 0,1,2,3
2018-03-05 11:43:14,edwardrews5,cold dry windy ardersier,False
2018-03-05 10:09:54,scooplorimer,sunny calm pleasant southam warks,False


Column index is reset here to ascending numerical values, and column names are retitled to be more comprehensible.

In [231]:
raw_df = raw_df.reset_index()
raw_df = raw_df.rename(index=str, columns={"index": "Date", 1: "Twitter Names", 2:"Words", 3:"Retweeted"})
raw_df.index.names = ["IndexLabel"]

In [232]:
raw_df.head(5)

Unnamed: 0_level_0,Date,Twitter Names,Words,Retweeted
IndexLabel,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,2018-03-05 11:43:14,edwardrews5,cold dry windy ardersier,False
1,2018-03-05 10:09:54,scooplorimer,sunny calm pleasant southam warks,False
2,2018-03-05 10:02:39,scooplorimer,sunny calm pleasant,False
3,2018-03-04 13:03:17,charleesaunt,misty rain cold thirsk northyorkshire,False
4,2018-03-04 11:41:24,c130medic,snowing johnstone,False


This is a function to identify any entries in the tweet body text that start with the letters "rt", as this is how retweets are formatted here. If this is the case then the letters "rt" are dropped from the string to enable uniform formatting for analysis and the value in the retweeted column is set to True to enable later distinction.
This function is then applied to our dataframe.

In [233]:
def setTrue(index, df):
    df["Retweeted"][index] = True
    
def removeRTText(index, df):
    tweet_text = df["Words"][index]
    new_tweet_text = tweet_text.lstrip("rt")
    df["Words"][index] = new_tweet_text
    
def textHasRT(index, df):
    tweet_text = df["Words"][index]
    return tweet_text.startswith("rt")
    
def process_rt_column(df):
     for index, row in df.iterrows():
        if(textHasRT(index, df)):
            setTrue(index, df)
            removeRTText(index, df)

In [234]:
process_rt_column(raw_df)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  import sys


In [235]:
raw_df.head(20)

Unnamed: 0_level_0,Date,Twitter Names,Words,Retweeted
IndexLabel,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,2018-03-05 11:43:14,edwardrews5,cold dry windy ardersier,False
1,2018-03-05 10:09:54,scooplorimer,sunny calm pleasant southam warks,False
2,2018-03-05 10:02:39,scooplorimer,sunny calm pleasant,False
3,2018-03-04 13:03:17,charleesaunt,misty rain cold thirsk northyorkshire,False
4,2018-03-04 11:41:24,c130medic,snowing johnstone,False
5,2018-03-04 11:34:58,franceshipple1,grey soggy foggy dunstable https//tco/3dcc7a1wbo,False
6,2018-03-04 11:00:43,cranesbill1970,snowing white sky todmorden,False
7,2018-03-04 10:57:03,copsey_david,cold damp dull,False
8,2018-03-04 10:54:37,foamcushion,happy snow day stay safe have fun wrapup warm ...,False
9,2018-03-04 10:43:30,stephenstaple13,fog damp cold melton mowbray,False


In [236]:
#raw_df(lambda x: x.strip())

In [237]:
#If a word in search_category is in the first three parts of "words" string separated by
#space, keep those rows, otherwise, drop rows.
#Then, find position within string using spaces of words in search_category
#Then, put any words that were in  in search category in new dataframe, with new column
#with numerical value of position within string, with one word per row.
#Also in new dataframe will be the relevant retweet value. Then calculate weight.
#Then merge with categorised dataframe

The first week of data is contained in this csv file. Here it is converted to a dataframe, the column weight being the occurence of the word within the total amount of entries, the category being the type of weather in question.

In [238]:
categorised_df = pd.read_csv('/s3/three-word-weather/hack/categorised_words.csv', header=0, names=['word', 'weight', 'category'], na_values=' ')

In [239]:
categorised_df.head(5)

Unnamed: 0,word,weight,category
0,correctness,0.000386,misc
1,gringey,0.000386,misc
2,could,0.000386,misc
3,sparta,0.000386,misc
4,crispy,0.000386,temperature


Here search_category is defined, a list of the 500 most common words from the initial weeks dataset. This will be used as filter criteria for the rest of the data, to give us a way to clean it for analysis.
It is acknolwedged that 500 common words from a week of specific weather context may not be the most reliable search criteria, however aside from manual analysis of categories from all entries or presuming potential tweet content, this method is considered sufficient here.

In [240]:
search_category = (categorised_df['word'])
search_category.head(5)

0    correctness
1        gringey
2          could
3         sparta
4         crispy
Name: word, dtype: object

In [241]:
#maybe have to split "words" here? maybe not, try do without first
#If a word in search_category is in the first three parts of "words" string separated by
#space, keep those rows, otherwise, drop rows.

The first three words in the tweet string will be searched here to see if they contain any words from the search_category. If they do not, either this means this tweet does not contain any of the common words and we are unable to analyse the category, therefore it will be dropped from the dataframe, or that the user did not submit it in the appropriate format e.g. with location information at the beginning rather than the end, meaning it is also not appropriate for analysis and will be dropped from the dataframe.

In [242]:
def words_at_start(index, df):
    tweet_text = df["Words"][index]
    return tweet_text.split()[:3]


#def drop_rows(index, df):
#    tweet_text = df["Words"][index]
#    df.drop(index, axis=0)

In [243]:
#tweet_text.extract("^((?:\w+\s){0,2}\w+)")

In [255]:
def keep_good_tweets(df, search_category):
    for word in search_category:
        for row in df.iterrows():
            index = row[0]
            if word in words_at_start(index, df):
                return df #it's this bit that's wrong
            else:
                df.drop(index)

In [256]:
new_df = keep_good_tweets(raw_df, search_category)

In [257]:
new_df.shape

(2852, 4)

In [248]:
raw_df.shape

(2852, 4)

In [247]:
new_df

Unnamed: 0_level_0,Date,Twitter Names,Words,Retweeted
IndexLabel,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,2018-03-05 11:43:14,edwardrews5,cold dry windy ardersier,False
1,2018-03-05 10:09:54,scooplorimer,sunny calm pleasant southam warks,False
2,2018-03-05 10:02:39,scooplorimer,sunny calm pleasant,False
3,2018-03-04 13:03:17,charleesaunt,misty rain cold thirsk northyorkshire,False
4,2018-03-04 11:41:24,c130medic,snowing johnstone,False
5,2018-03-04 11:34:58,franceshipple1,grey soggy foggy dunstable https//tco/3dcc7a1wbo,False
6,2018-03-04 11:00:43,cranesbill1970,snowing white sky todmorden,False
7,2018-03-04 10:57:03,copsey_david,cold damp dull,False
8,2018-03-04 10:54:37,foamcushion,happy snow day stay safe have fun wrapup warm ...,False
9,2018-03-04 10:43:30,stephenstaple13,fog damp cold melton mowbray,False


In [16]:
def clean(raw_df, search_category):
    clean_words = dict()
    for word in search_category:
        for index, row in df.iterrows():
            if word in row[2]:
                if word in clean_words.keys():
                    clean_words[word] = clean_words[word] + 1
                else:
                    clean_words[word] = 1
    return clean_words
                
new_words = clean(raw_df, search_category)

In [17]:
def get_weighting(count):
    val = count / (raw_df.shape[0] * 3)
    return val

In [18]:
for key in new_words.keys():
    new_words[key] = get_weighting(new_words[key])

In [19]:
words, values = list(zip(*new_words.items()))
data = {'word':words, 'weight1':values}
new_df = pd.DataFrame(data, columns=['word', 'weight1'])
new_df.head(5)

Unnamed: 0,word,weight1
0,could,0.000387
1,gale,0.000678
2,on,0.06963
3,bad,0.001356
4,fecking,0.000291


In [None]:
new_df['weight1'].apply(lambda x:float(x))

In [None]:
new_df.head(5)

In [None]:
categorised_df.head(5)

In [22]:
merged_df = categorised_df.merge(new_df, on=("word"), how=("left"))
merged_df.head(5)

Unnamed: 0,word,weight,category,weight1
0,correctness,0.000386,misc,
1,gringey,0.000386,misc,
2,could,0.000386,misc,0.000387
3,sparta,0.000386,misc,
4,crispy,0.000386,temperature,


In [23]:
#Making dataframe with new and old words in separate columns, with row for new words

In [24]:
merged_df_cut = merged_df.dropna(subset = ["weight1"])

In [25]:
merged_df_cut.head(5)

Unnamed: 0,word,weight,category,weight1
2,could,0.000386,misc,0.000387
5,gale,0.000386,wind,0.000678
7,on,0.000386,misc,0.06963
8,bad,0.000386,misc,0.001356
10,fecking,0.000386,swearing,0.000291


In [26]:
merged_df_cut.shape

(309, 4)

In [42]:
merged_df_cut["category"] = merged_df_cut["category"].apply(lambda x: x.lower())
merged_df_cut["category"].replace(to_replace="sky_state", value="sky state", inplace=True)

In [43]:
#attempting to exclude outliers from dataframe and delete row if excluded...come back to later...

In [44]:
merged_df_cut["category"]

2               misc
5               wind
7               misc
8               misc
10          swearing
11         sky state
16              misc
17              misc
18         sky state
19              misc
21     precipitation
22              misc
23              misc
24              misc
28              misc
31         sky state
34              misc
36              misc
37              misc
40              misc
41              misc
42              wind
44     precipitation
46     precipitation
53     precipitation
54              misc
56              misc
61          swearing
62              misc
66              misc
           ...      
463             misc
464      temperature
465             misc
466             misc
467        sky state
468        sky state
470             misc
471        sky state
472             misc
473        sky state
474      temperature
475    precipitation
476    precipitation
477      temperature
478    precipitation
479             misc
482        sk

In [45]:
x1 = merged_df_cut["word"]
x2 = merged_df_cut["category"]
y1 = merged_df_cut["weight"]
y2 = merged_df_cut["weight1"]

In [1]:
y1t = y1[~((y1-y1.mean()).abs()>10*y1.std())]
y2t = y2[~((y2-y2.mean()).abs()>10*y2.std())]

NameError: name 'y1' is not defined

In [30]:
merged_df_cut.head(5)

Unnamed: 0,word,weight,category,weight1
2,could,0.000386,misc,0.000387
5,gale,0.000386,wind,0.000678
7,on,0.000386,misc,0.06963
8,bad,0.000386,misc,0.001356
10,fecking,0.000386,swearing,0.000291


In [86]:
#merged_df_cut["category"] = (merged_df_cut["category"]).lower

AttributeError: 'Series' object has no attribute 'lower'

In [60]:
#understand psychological response to weather-deduce by relatedness
#within subject between different categories-manually categorise for impact statements,
#sort through phrases or noun strings in order to analyse?
#Would be much bigger project...
#Milestone 1-compare word occurence between initial and post, in individual words and categories.
#Milestone 2? manually assess impactfulness (1-10), analyse this within each category and between,
#organise data with location

In [71]:
from numpy.random import random

from bokeh.plotting import figure, show, output_notebook

def mscatter(p, x, y, marker):
    p.scatter(x, y, marker=marker, size=15,
              line_color="navy", fill_color="orange", alpha=0.5)

p = figure(title="Word usage: Initial vs. Post", toolbar_location=None)
p.grid.grid_line_color = None
p.background_fill_color = "#eeeeee"

N = 10

mscatter(p, [float(value) for value in y1t.values], [float(value) for value in y2t.values], "circle")

output_notebook()

show(p)  # make points different colours based on category of word



In [92]:
colormap = {'misc': 'red', 'Misc': 'red', 'sky state': 'green', 'sky_state': 'green', 'swearing': 'blue', "wind": "yellow", "temperature": "orange", "precipitation": "purple"}
colors = [colormap[x] for x in merged_df_cut['category']]

p = figure(title = "Word Occurence: Initial vs. Post")
p.xaxis.axis_label = 'Initial Weight(normalised to decimal)'
p.yaxis.axis_label = 'Post Weight(normalised to decimal)'

p.circle(y1, y2,
         color=colors, fill_alpha=0.2, legend="category", size=10)

output_notebook()

show(p)
#sort out legend



In [36]:
#from bokeh.plotting import figure, show, output_notebook

#colormap = {'misc': 'red', 'wind': 'green', 'sky state': 'blue', 'precipitation': 'yellow',
            #'swearing': 'purple', 'temperature': 'orange'}
#colors = [colormap[x] for x in merged_df_cut['category']]

#p = figure(title = "The usa")
#p.xaxis.axis_label = 'Word'
#p.yaxis.axis_label = 'Weight'

#p.circle(merged_df_cut["y1t"], merged_df_cut["y2t"],
         #color=colors, fill_alpha=0.2, size=10)

#output_notebook()

#show(p)

In [65]:
#[float(value) for value in y2.values]