## Altair Exercises

This notebook will explore multiple different visualizations in Altair.

______

### Part 4

The following exercise is based on the article by FiveThirtyEight [The Mayweather-McGregor Fight, As Told Through Emojis](https://fivethirtyeight.com/features/the-mayweather-mcgregor-fight-as-told-through-emojis/). 

It leverages the dataset [tweets](data/tweets.csv) Created by FiveThirtyEight with the Twitter Streaming API containing a sample of all the tweets that matched the search terms: #MayMac, #MayweatherMcGregor, #MayweatherVMcGregor, #MayweatherVsMcGregor, #McGregor and #Mayweather collected between 12:05 a.m. and 1:15 a.m. EDT, 12,118 that had emojis. Available [on github](https://github.com/fivethirtyeight/data/tree/master/mayweather-mcgregor)

In [1]:
import pandas as pd
import numpy as np
import altair as alt

In [2]:
# enable correct rendering
alt.renderers.enable('default')

# uses intermediate json files to speed things up
alt.data_transformers.enable('json')

DataTransformerRegistry.enable('json')

In [3]:
# load the tweets
tweets = pd.read_csv('../assets/tweets.csv')

# we're going to process the data in a couple of ways
# first, we want to know how many emojis are in each tweet so we'll create a new column
# that counts them
tweets['emojis'] = tweets['text'].str.findall(r'[^\w\s.,"@\'?/#!$%\^&\*;:{}=\-_`~()\U0001F1E6-\U0001F1FF]').str.len()

# next, there are a few specific emojis that we care about, we're going to create
# a column for each one and indicate how many times it showed up in the tweet
boxer_emojis = ['☘️','🇮🇪','🍀','💸','🤑','💰','💵','😴','😂','🤣','🥊','👊','👏','🇮🇪','💪','🔥','😭','💰']
for emoji in boxer_emojis:
    # here's a different way to get the counts
    tweets[emoji] = tweets.text.str.count(emoji)
    
# For the irish pride vs the money team we want the numer 
# of either ☘️, 🇮🇪 or 🍀 and 💸, 🤑, 💰 or 💵 for each
tweets['irish_pride'] = tweets['☘️'] + tweets['🇮🇪'] + tweets['🍀']
tweets['money_team'] = tweets['💸'] + tweets['🤑'] + tweets['💰'] +  tweets['💵']

In [4]:
tweets.head()

Unnamed: 0,created_at,emojis,id,link,retweeted,screen_name,text,☘️,🇮🇪,🍀,...,😂,🤣,🥊,👊,👏,💪,🔥,😭,irish_pride,money_team
0,2017-08-27 00:05:34,1,901656910939770881,https://twitter.com/statuses/901656910939770881,False,aaLiysr,Ringe çıkmadan ateş etmeye başladı 😃#McGregor ...,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,2017-08-27 00:05:35,5,901656917281574912,https://twitter.com/statuses/901656917281574912,False,zulmafrancozaf,😲😲😲😲😲 @lalylourbet2 https://t.co/ERUGHhQINE,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,2017-08-27 00:05:35,2,901656917105369088,https://twitter.com/statuses/901656917105369088,False,Adriana11D,🇮🇪🇮🇪🇮🇪 💪💪#MayweathervMcgregor,0,3,0,...,0,0,0,0,0,2,0,0,3,0
3,2017-08-27 00:05:35,2,901656917747142657,https://twitter.com/statuses/901656917747142657,False,Nathan_Caro_,Cest partit #MayweatherMcGregor 💪🏿,0,0,0,...,0,0,0,0,0,1,0,0,0,0
4,2017-08-27 00:05:35,2,901656916828594177,https://twitter.com/statuses/901656916828594177,False,sahouraxox,Low key feeling bad for ppl who payed to watch...,0,0,0,...,0,2,0,0,0,0,0,0,0,0


## The Mayweather-McGregor Fight, As Told Through Emojis

### <font color=grey>We laughed, cried and cried some more. </font>

_Original article available at [FiveThirtyEight](https://fivethirtyeight.com/features/the-mayweather-mcgregor-fight-as-told-through-emojis/)_

By [Dhrumil Mehta](https://fivethirtyeight.com/contributors/dhrumil-mehta/), [Oliver Roeder](https://fivethirtyeight.com/contributors/oliver-roeder/) and [Rachael Dottle](https://fivethirtyeight.com/contributors/rachael-dottle/)

Filed under [Mayweather vs. McGregor](https://fivethirtyeight.com/tag/mayweather-vs-mcgregor/)

Get the data on [GitHub](https://github.com/fivethirtyeight/data/tree/master/mayweather-mcgregor)

For the nearly 15,000 people in Las Vegas’s T-Mobile Arena on Saturday night, and the millions more huddled around TVs across the world, the Floyd Mayweather–Conor McGregor fight was a roller coaster of emotions. They were anxious as pay-per-view [technical problems](http://www.espn.com/boxing/story/_/id/20469815/floyd-mayweather-conor-mcgregor-delay-ppv-problems) pushed back the fight’s start. They were full of anticipation when the combatants finally emerged after months of hype. They were surprised when McGregor held his own, or seemed to hold his own, for a couple of rounds. They were thrilled when Mayweather finally started fighting. And they were exhausted by the end.

How do we know all this? Emojis.

We were monitoring Twitter on fight night, pulling tweets that contained fight-related hashtags — those that included #MayweatherVsMcgregor, for example. In the end, we collected about 200,000 fight-related tweets, of which more than 12,000 contained emojis. (To be clear, that’s a small enough sample that this emojinalysis might not make it through peer review.)<sup>1<sup>

In [5]:
# dictionary that will map emoji to percentage
percentages = {}  

# find total emojies
total = tweets['emojis'].sum() 

# for each emoji, figure out how prevalent it is
emojis = ['😂','🤣','🥊','👊','👏','🇮🇪','💪','🔥','😭','💰']
for emoji in emojis:  
    percentages[emoji] = [round(tweets[emoji].sum() / total * 100,1)]
    
# create a data frame to hold this from the dictionary
percentages_df = pd.DataFrame.from_dict(percentages).T

# sort the dictionary 
percentages_df = percentages_df.sort_values(by=[0], ascending = False).reset_index()

# rename the columns
percentages_df = percentages_df.rename(columns={'index':'EMOJI', 0: 'PERCENT'})

# create a rank column based on position in the ordered list
percentages_df['rank'] = pd.Index(list(range(1,11)))

# modify the text
percentages_df['PERCENT_TEXT'] = percentages_df['PERCENT'].astype('str') + ' %'

In [6]:
percentages_df

Unnamed: 0,EMOJI,PERCENT,rank,PERCENT_TEXT
0,😂,23.1,1,23.1 %
1,🥊,5.7,2,5.7 %
2,👊,3.5,3,3.5 %
3,👏,3.0,4,3.0 %
4,💪,2.5,5,2.5 %
5,🇮🇪,2.4,6,2.4 %
6,🤣,2.3,7,2.3 %
7,🔥,2.3,8,2.3 %
8,😭,2.0,9,2.0 %
9,💰,1.8,10,1.8 %


In [7]:
# use percentages_df to recreate the visualization above

sort_pct = list(percentages_df.sort_values(by=['PERCENT'],ascending=False)['PERCENT'])
sort_emoji = list(percentages_df.sort_values(by=['PERCENT'],ascending=False)['EMOJI'])

bars = alt.Chart(percentages_df).mark_bar().encode(
    x=alt.X(
        'PERCENT:Q',
        axis=None),
    y=alt.Y(
        'PERCENT_TEXT:N',
        axis=None,
        sort=sort_pct)
).properties(
    height=198
)

ranked_emoji = alt.Chart(percentages_df).mark_text(
).encode(
    y=alt.Y(
        'row_number:O',
        axis=None),
).transform_window(
    row_number='row_number()'
).transform_window(
    rank='rank(row_number)'
)

ranked_emoji

emojis = ranked_emoji.encode(text='EMOJI:N')
emoji_pcts = ranked_emoji.encode(text='PERCENT_TEXT:O')
tables = alt.hconcat(emojis, emoji_pcts)

emoji_dist = (tables | bars).configure_mark(
    color='#F9A602'
).configure_view(
    strokeWidth=0
)

emoji_dist

There were the likely frontrunners for most-used emoji: the 🥊, the 👊, the 💪. But the emoji of the fight was far and away the 😂. (“Face with tears of joy.”)<sup>2<sup>
    
> <font color=grey> 1.2. That’s certainly appropriate for this spectacle, but it should be noted that 😂 is also the [most tweeted](http://emojitracker.com/) emoji generally.</font>

Here’s how the night unfolded, emoji-wise. (All of the charts below show them on a four-minute rolling average.)

For one thing, the fight was a sharply partisan affair. The majority of people in the arena appeared to be McGregor fans — he hails from Dublin and an Irish flag, worn cape-style, almost seemed like the evening’s dress code. But other fans were members of TMT — The Money Team — and loyal to “Money” Mayweather. Twitter’s loyalties came and went as the match progressed, with enthusiasm from either camp seemingly matching each fighter’s success.

In [8]:
# Pre-Processing

# We're going to want to work with time objects so we need to make a datetime
# column (basically transforming the text in "created at"). It duplicates
# the data but it will make things easier
tweets['datetime'] = pd.to_datetime(tweets['created_at'])
tweets = tweets.set_index('datetime')

# next we're going to creat a rolling average
# first for the money team
mdf = (tweets['money_team'].rolling('4Min').mean().groupby(pd.Grouper(freq='15S')).mean() * 15).reset_index()
mdf['team'] = '💸🤑💰💵'
mdf = mdf.rename(columns={'money_team':'tweet_count'})

# next for the irish team
idf = (tweets['irish_pride'].rolling('4Min').mean().groupby(pd.Grouper(freq='15S')).mean() * 15).reset_index()
idf['team'] = '☘️🍀🇮🇪'
idf = idf.rename(columns={'irish_pride':'tweet_count'})

# now we'll combine our datasets
ndf = pd.concat([mdf,idf])

In [9]:
ndf.sample(5)

Unnamed: 0,datetime,tweet_count,team
164,2017-08-27 00:46:30,1.418833,💸🤑💰💵
11,2017-08-27 00:08:15,0.82778,💸🤑💰💵
45,2017-08-27 00:16:45,2.182686,☘️🍀🇮🇪
6,2017-08-27 00:07:00,0.972265,💸🤑💰💵
132,2017-08-27 00:38:30,0.995267,💸🤑💰💵


In [10]:
# we're also going to create an annotations
annotations = [['2017-08-27 00:10:00',4, 'Fight begins'],
               ['2017-08-27 00:22:00',5, 'McGregor does OK \nin the early rounds'],
               ['2017-08-27 00:53:00',4, "Mayweather takes \nover and wins by \nTKO"]]
a_df = pd.DataFrame(annotations, columns=['date','count','note'])

# we're also going to create an annotation line
a_line = pd.DataFrame({
    'x': ['2017-08-27 00:15:00', '2017-08-27 00:15:00', '2017-08-27 00:30:00', '2017-08-27 00:30:00'],
    'y': [3.5, 2, 4.2, 3.9],
    'class': ['A', 'A', 'B', 'B']
})
a_line['x'] = pd.to_datetime(a_line['x'])

In [11]:
fight_lines = alt.Chart(ndf).mark_line().encode(
    x=alt.X('datetime:T',title=None,axis=alt.Axis(tickCount=5)), #bin=alt.Bin(step=100)
    y=alt.Y('tweet_count:Q',title='Four-minute rolling average'),
    color=alt.Color('team', legend=alt.Legend(orient="top",title=None,labelFontSize=24),scale=alt.Scale(
            domain=['☘️🍀🇮🇪', '💸🤑💰💵'],
            range=['#4BAB4E', '#FCCB28'])),
)

annotations = alt.Chart(a_df).encode(
    x=alt.X('date:T',axis=None),
    y=alt.Y('count:Q',axis=None),
    text=alt.Text('note:O'),
).mark_text(lineBreak='\n',align='left',fontSize=16) #.properties(width=15, height=240)

fight_ano_line = alt.Chart(a_line).mark_line(color='black').encode(
    x='x',
    y='y',
    detail='class'
)


# (fight_lines + annotations + fight_ano_line ).configure(background='#F0F0F0').configure_view(
fight_lines.configure(background='#F0F0F0').configure_view(
    # we don't want a stroke around the bars
    strokeWidth=0
).properties(
    title = alt.TitleParams(text = "Irish Pride VS The Money Machine", 
                            subtitle = ["Four-minute rolling average of the number of the number of uses of selected emoji in",
                                        "sampled tweets during the Mayweather-Mcregor Fight"],
                            font = 'Helvetica Neue', 
                            fontSize = 26, 
                            color = '#3C3C3C',
                            subtitleColor='#3C3C3C', 
                            subtitleFontSize = 18,
                            anchor='start',
                            offset=20,
                            ),
    # set the dimensions of the visualization
    width=600,
    height=300
).configure_axis(labelColor='#cccccc',domainColor='#cccccc',tickColor='#cccccc')

To the surprise of many (of the neutral and pro-Mayweather viewers, anyway) McGregor won the first round. The next couple were washes, and a quarter of the way into the [scheduled 12 rounds](https://www.nytimes.com/2017/08/26/sports/mayweather-mcgregor.html) … the Irish underdog may have been winning! The Irish flags and shamrocks followed on Twitter. Things slowly (perhaps even 😴ly) turned around as one of the best pound-for-pound boxers in history took control of the man making his pro debut — an outcome which was predicted by precisely everyone. Out came the emoji money bags.

By the sixth round, it seemed like only a matter of time until the old pro dismantled the newcomer. By the ninth it was clear Mayweather was going for the knockout. It came soon thereafter. Mayweather unleashed a vicious flurry of punches in the 10th and the ref stepped in, declaring Mayweather the victor and saving McGregor, who was somehow still on his feet, from further damage.

In [12]:
# We're going to want to work with time objects so we need to make a datetime
# column (basically transforming the text in "created at"). It duplicates
# the data but it will make things easier
tweets['datetime'] = pd.to_datetime(tweets['created_at'])
tweets = tweets.set_index('datetime')

# next we're going to creat a rolling average
# first for fire
firedf = (tweets['🔥'].rolling('4Min').mean().groupby(pd.Grouper(freq='15S')).mean() * 15).reset_index()
firedf['sentiment'] = '🔥'
firedf = firedf.rename(columns={'🔥':'tweet_count'})

# next for the snooze
snoozedf = (tweets['😴'].rolling('4Min').mean().groupby(pd.Grouper(freq='15S')).mean() * 15).reset_index()
snoozedf['sentiment'] = '😴'
snoozedf = snoozedf.rename(columns={'😴':'tweet_count'})

# now we'll combine our datasets
hb = pd.concat([firedf,snoozedf])
hb.sample(5)

Unnamed: 0,datetime,tweet_count,sentiment
208,2017-08-27 00:57:30,0.18783,😴
38,2017-08-27 00:15:00,0.258195,😴
71,2017-08-27 00:23:15,0.220278,😴
199,2017-08-27 00:55:15,1.149673,🔥
195,2017-08-27 00:54:15,1.051839,🔥


In [13]:
range_ = ['red', '#66CCCC']
annot = pd.DataFrame({'x' : ['2017-08-27 00:15:00', '2017-08-27 00:15:00', '2017-08-27 00:30:30', '2017-08-27 00:30:30', '2017-08-27 00:45:00', '2017-08-27 00:45:00'],
                      'y' : [0.4, 0.9, 0.75, 1.25, 0.9, 1.25], 
                      'class' : ['A', 'A', 'B', 'B', 'C', 'C']
                     })
annot['x'] = pd.to_datetime(annot['x'])

# we're also going to create an annotations data frame to help you
annotations = [['2017-08-27 00:10:00',1, 'Fight begins'],
               ['2017-08-27 00:33:00',3.5, 'Mayweather \ntakes control in \nmiddle rounds']]
sf_a_df = pd.DataFrame(annotations, columns=['date','count','note'])


lines = alt.Chart(hb).mark_line(
).encode(
    x=alt.X(
        'datetime',
        axis=alt.Axis(
            title='',
        tickMinStep=15,
        tickCount=4)
    ),
    y=alt.Y(
        'tweet_count:Q', 
        axis=alt.Axis(
            title='Four-minute rulling average',
            tickMinStep=0.5,
            tickCount=5)
    ),
    color=alt.Color(
        'sentiment:N', 
        scale=alt.Scale(
            range=range_
        ))
)

# annotations = alt.Chart(annot).mark_line().encode(
#     x='x',
#     y='y',
#     detail='class'
# )

# hb_annotations = alt.Chart(sf_a_df).encode(
#     x=alt.X('date:T',axis=None),
#     y=alt.Y('count:Q',axis=None),
#     text=alt.Text('note:O'),
# ).mark_text(lineBreak='\n',align='left',fontSize=16)

final3 = lines

team = final3.configure(
    background='#eeeeee'
).configure_axis(
    labelFontSize=10,
    labelOpacity=0.4,
    tickOpacity=0.2,
    domainOpacity=0.2,
    titleFontSize=14
).configure_title(
    anchor='start',
    font='Helvetica',
    fontSize=24,
    fontWeight='bold',
    dy=-20
).configure_legend(
    labelFontSize=30,
    orient='top',
    title=None,
    titlePadding=10
).properties(
    title=alt.TitleParams(text = "Much hype, some boredom", 
                          subtitle = ["Four-minute rolling average of the number of the number of uses of selected emoji in",
                                      "sampled tweets during the Mayweather-Mcregor Fight"],
                          font = 'Helvetica Neue', 
                          fontSize = 26, 
                          color = '#3C3C3C',
                          subtitleColor='#3C3C3C', 
                          subtitleFontSize = 18,
                          anchor='start',
                          offset=20,
                         ), 
    width=600,
    height=300
)

team

#### Yet Another Visualization

In [14]:
# We're going to want to work with time objects so we need to make a datetime
# column (basically transforming the text in "created at"). It duplicates
# the data but it will make things easier
tweets['datetime'] = pd.to_datetime(tweets['created_at'])
tweets = tweets.set_index('datetime')

tears = tweets.copy()
tears = tears.resample('1s').sum()
tears  = tears[(tears['😭']>0) | (tears['🤣']>0)]

# We're going to creat a rolling average
cry = tears['😭'] .rolling('4Min').mean().reset_index()
cry['emoji'] = '😭'
cry = cry.rename(columns={'😭':'tweet_count'})

laugh = tears['🤣'].rolling('4Min').mean().reset_index()
laugh['emoji'] = '🤣'
laugh = laugh.rename(columns={'🤣':'tweet_count'})

# now we'll combine our datasets
cl_df = pd.concat([cry,laugh])
cl_df.head()

Unnamed: 0,datetime,tweet_count,emoji
0,2017-08-27 00:05:35,0.0,😭
1,2017-08-27 00:05:59,0.0,😭
2,2017-08-27 00:06:12,0.666667,😭
3,2017-08-27 00:06:17,0.75,😭
4,2017-08-27 00:06:27,0.6,😭


In [15]:
# we're also going to create an annotations data frame to help you
annotations = [['2017-08-27 00:10:00',2.25, 'Fight begins'],
               ['2017-08-27 00:30:00',2.25, 'McGregor \nimpresses \nearly'],
               ['2017-08-27 00:50:00',.25, 'Fight ends']]
cl_a_df = pd.DataFrame(annotations, columns=['date','count','note'])

# we're also going to create an annotation line
a3_line = pd.DataFrame({
    'x': ['2017-08-27 00:15:00', '2017-08-27 00:15:00', '2017-08-27 00:29:00', '2017-08-27 00:24:00','2017-08-27 00:55:00','2017-08-27 00:55:00'],
    'y': [2.1, 1.5, 2, 1.6,.4,.9],
    'class': ['A', 'A', 'B', 'B','C','C']
})
a3_line['x'] = pd.to_datetime(a3_line['x'])

cl_lines = alt.Chart(cl_df).mark_line().encode(
    x=alt.X('datetime:T',title=None,axis=alt.Axis(tickCount=5)), #bin=alt.Bin(step=100)
    y=alt.Y('tweet_count:Q',title='Four-minute rolling average'),
    color=alt.Color('emoji', legend=alt.Legend(orient="top",title=None,labelFontSize=24),scale=alt.Scale(
            domain=['😭', '🤣'],
            range=['#3EC1C7', '#F6801D'])),
)

# cl_annotations = alt.Chart(cl_a_df).encode(
#     x=alt.X('date:T',axis=None),
#     y=alt.Y('count:Q',axis=None),
#     text=alt.Text('note:O'),
# ).mark_text(lineBreak='\n',align='left',fontSize=16) #.properties(width=15, height=240)

# ano3_line = alt.Chart(a3_line).mark_line(color='black').encode(
#     x='x',
#     y='y',
#     detail='class'
# )

# (cl_lines + cl_annotations + ano3_line ).configure(background='#F0F0F0').configure_view(
cl_lines.configure(background='#F0F0F0').configure_view(
    # we don't want a stroke around the bars
    strokeWidth=0
).properties(
    title = alt.TitleParams(text = "Tears were shed – of joy and sorrow", 
                            subtitle = ["Four-minute rolling average of the number of the number of uses of selected emoji in",
                                        "sampled tweets during the Mayweather-Mcregor Fight"],
                            font = 'Helvetica Neue', 
                            fontSize = 26, 
                            color = '#3C3C3C',
                            subtitleColor='#3C3C3C', 
                            subtitleFontSize = 18,
                            anchor='start',
                            offset=20,
                            ),
    # set the dimensions of the visualization
    width=600,
    height=300
).configure_axis(labelColor='#cccccc',domainColor='#cccccc',tickColor='#cccccc')

______________________
<div style="text-align: right"><sub>Exercise adapted and modified from UMSI homework assignment for SIADS 522.</sub></div>