# Analyse - Predict

Functions are important in reducing the replication of code as well as giving the user the functionality of getting an ouput on varying inputs. The functions you will write all use Eskom data/variables.

## Instructions to Students
- **Do not add or remove cells in this notebook. Do not edit or remove the `### START FUNCTION` or `### END FUNCTION` comments. Do not add any code outside of the functions you are required to edit. Doing any of this will lead to a mark of 0%!**
- Answer the questions according to the specifications provided.
- Use the given cell in each question to to see if your function matches the expected outputs.
- Do not hard-code answers to the questions.
- The use of stackoverflow, google, and other online tools are permitted. However, copying fellow student's code is not permissible and is considered a breach of the Honour code. Doing this will result in a mark of 0%.
- Good luck, and may the force be with you!

## Imports

In [17]:
import pandas as pd
import numpy as np

## Data Loading and Preprocessing

### Electricification by province (EBP) data

In [18]:
ebp_url = 'https://raw.githubusercontent.com/Explore-AI/Public-Data/master/Data/electrification_by_province.csv'
ebp_df = pd.read_csv(ebp_url)

for col, row in ebp_df.iloc[:,1:].iteritems():
    ebp_df[col] = ebp_df[col].str.replace(',','').astype(int)

ebp_df.head()

Unnamed: 0,Financial Year (1 April - 30 March),Limpopo,Mpumalanga,North west,Free State,Kwazulu Natal,Eastern Cape,Western Cape,Northern Cape,Gauteng
0,2000/1,51860,28365,48429,21293,63413,49008,48429,6168,39660
1,2001/2,68121,26303,38685,20928,64123,45773,38685,10359,36024
2,2002/3,49881,11976,28532,10316,63078,55748,28532,6869,32127
3,2003/4,42034,33515,34027,16135,60282,47414,34027,10976,39488
4,2004/5,54646,16218,21450,5668,37811,42041,21450,6316,18422


### Twitter data

In [19]:
twitter_url = 'https://raw.githubusercontent.com/Explore-AI/Public-Data/master/Data/twitter_nov_2019.csv'
twitter_df = pd.read_csv(twitter_url)
twitter_df.head()

Unnamed: 0,Tweets,Date
0,@BongaDlulane Please send an email to mediades...,2019-11-29 12:50:54
1,@saucy_mamiie Pls log a call on 0860037566,2019-11-29 12:46:53
2,@BongaDlulane Query escalated to media desk.,2019-11-29 12:46:10
3,"Before leaving the office this afternoon, head...",2019-11-29 12:33:36
4,#ESKOMFREESTATE #MEDIASTATEMENT : ESKOM SUSPEN...,2019-11-29 12:17:43


## Important Variables (Do not edit these!)

In [20]:
# gauteng ebp data as a list
gauteng = ebp_df['Gauteng'].astype(float).to_list()

# dates for twitter tweets
dates = twitter_df['Date'].to_list()

# dictionary mapping official municipality twitter handles to the municipality name
mun_dict = {
    '@CityofCTAlerts' : 'Cape Town',
    '@CityPowerJhb' : 'Johannesburg',
    '@eThekwiniM' : 'eThekwini' ,
    '@EMMInfo' : 'Ekurhuleni',
    '@centlecutility' : 'Mangaung',
    '@NMBmunicipality' : 'Nelson Mandela Bay',
    '@CityTshwane' : 'Tshwane'
}

# dictionary of english stopwords
stop_words_dict = {
    'stopwords':[
        'where', 'done', 'if', 'before', 'll', 'very', 'keep', 'something', 'nothing', 'thereupon', 
        'may', 'why', 'â€™s', 'therefore', 'you', 'with', 'towards', 'make', 'really', 'few', 'former', 
        'during', 'mine', 'do', 'would', 'of', 'off', 'six', 'yourself', 'becoming', 'through', 
        'seeming', 'hence', 'us', 'anywhere', 'regarding', 'whole', 'down', 'seem', 'whereas', 'to', 
        'their', 'various', 'thereafter', 'â€˜d', 'above', 'put', 'sometime', 'moreover', 'whoever', 'although', 
        'at', 'four', 'each', 'among', 'whatever', 'any', 'anyhow', 'herein', 'become', 'last', 'between', 'still', 
        'was', 'almost', 'twelve', 'used', 'who', 'go', 'not', 'enough', 'well', 'â€™ve', 'might', 'see', 'whose', 
        'everywhere', 'yourselves', 'across', 'myself', 'further', 'did', 'then', 'is', 'except', 'up', 'take', 
        'became', 'however', 'many', 'thence', 'onto', 'â€˜m', 'my', 'own', 'must', 'wherein', 'elsewhere', 'behind', 
        'becomes', 'alone', 'due', 'being', 'neither', 'a', 'over', 'beside', 'fifteen', 'meanwhile', 'upon', 'next', 
        'forty', 'what', 'less', 'and', 'please', 'toward', 'about', 'below', 'hereafter', 'whether', 'yet', 'nor', 
        'against', 'whereupon', 'top', 'first', 'three', 'show', 'per', 'five', 'two', 'ourselves', 'whenever', 
        'get', 'thereby', 'noone', 'had', 'now', 'everyone', 'everything', 'nowhere', 'ca', 'though', 'least', 
        'so', 'both', 'otherwise', 'whereby', 'unless', 'somewhere', 'give', 'formerly', 'â€™d', 'under', 
        'while', 'empty', 'doing', 'besides', 'thus', 'this', 'anyone', 'its', 'after', 'bottom', 'call', 
        'nâ€™t', 'name', 'even', 'eleven', 'by', 'from', 'when', 'or', 'anyway', 'how', 'the', 'all', 
        'much', 'another', 'since', 'hundred', 'serious', 'â€˜ve', 'ever', 'out', 'full', 'themselves', 
        'been', 'in', "'d", 'wherever', 'part', 'someone', 'therein', 'can', 'seemed', 'hereby', 'others', 
        "'s", "'re", 'most', 'one', "n't", 'into', 'some', 'will', 'these', 'twenty', 'here', 'as', 'nobody', 
        'also', 'along', 'than', 'anything', 'he', 'there', 'does', 'we', 'â€™ll', 'latterly', 'are', 'ten', 
        'hers', 'should', 'they', 'â€˜s', 'either', 'am', 'be', 'perhaps', 'â€™re', 'only', 'namely', 'sixty', 
        'made', "'m", 'always', 'those', 'have', 'again', 'her', 'once', 'ours', 'herself', 'else', 'has', 'nine', 
        'more', 'sometimes', 'your', 'yours', 'that', 'around', 'his', 'indeed', 'mostly', 'cannot', 'â€˜ll', 'too', 
        'seems', 'â€™m', 'himself', 'latter', 'whither', 'amount', 'other', 'nevertheless', 'whom', 'for', 'somehow', 
        'beforehand', 'just', 'an', 'beyond', 'amongst', 'none', "'ve", 'say', 'via', 'but', 'often', 're', 'our', 
        'because', 'rather', 'using', 'without', 'throughout', 'on', 'she', 'never', 'eight', 'no', 'hereupon', 
        'them', 'whereafter', 'quite', 'which', 'move', 'thru', 'until', 'afterwards', 'fifty', 'i', 'itself', 'nâ€˜t',
        'him', 'could', 'front', 'within', 'â€˜re', 'back', 'such', 'already', 'several', 'side', 'whence', 'me', 
        'same', 'were', 'it', 'every', 'third', 'together'
    ]
}

## Function 1: Metric Dictionary

Write a function that calculates the mean, median, variance, standard deviation, minimum and maximum of of list of items. You can assume the given list is contains only numerical entries, and you may use numpy functions to do this.

**Function Specifications:**
- Function should allow a list as input.
- It should return a `dict` with keys `'mean'`, `'median'`, `'std'`, `'var'`, `'min'`, and `'max'`, corresponding to the mean, median, standard deviation, variance, minimum and maximum of the input list, respectively.
- The standard deviation and variance values must be unbiased. **Hint:** use the `ddof` parameter in the corresponding numpy functions!
- All values in the returned `dict` should be rounded to 2 decimal places.

In [21]:
### START FUNCTION
def dictionary_of_metrics(items):

    """ A function that calculates the mean, median, variance, standard deviation, minimum and maximum of a list of items rounded off to 2               decimal places """

    n = len(items)
    items = sorted(items)
    
    #mean
    # Calculate the mean of the list of items
    sum1 = sum(items)
    mean = round(sum1/n,2)
    
    #median 
    # Calculate the median of the list of items
    if n % 2 == 0: 
        med= items[n//2] 
        med2= items[n//2 - 1] 
        median = round((med + med2)/2, 2)
    else: 
        median = round(items[n//2], 2) 
    
    #variance 
    # Calculate the variance of the list of items
    variance = round(np.var(items, ddof=1), 2)
    
    #standard deviation
    # Calculate the standard deviation of the list of items
    std = round(np.std(items, ddof=1), 2)

    #maximum
    # Calculate the maximum of the list of items
    max1 =round(max(items), 2)

    #minimum
    # Calculate the minimum of the list of items
    min1 = round(min(items), 2)


    # Return a dictionary of metrics    
    return {'mean':mean,'median': median,'var': variance,'std':std, 'min':min1, 'max':max1}

### END FUNCTION

In [22]:
dictionary_of_metrics(gauteng)

{'mean': 26244.42,
 'median': 24403.5,
 'var': 108160153.17,
 'std': 10400.01,
 'min': 8842.0,
 'max': 39660.0}

_**Expected Output**_:

```python
dictionary_of_metrics(gauteng) == {'mean': 26244.42,
                                   'median': 24403.5,
                                   'var': 108160153.17,
                                   'std': 10400.01,
                                   'min': 8842.0,
                                   'max': 39660.0}
 ```

## Function 2: Five Number Summary

Write a function which takes in a list of integers and returns a dictionary of the [five number summary.](https://www.statisticshowto.datasciencecentral.com/how-to-find-a-five-number-summary-in-statistics/).

**Function Specifications:**
- The function should take a list as input.
- The function should return a `dict` with keys `'max'`, `'median'`, `'min'`, `'q1'`, and `'q3'` corresponding to the maximum, median, minimum, first quartile and third quartile, respectively. You may use numpy functions to aid in your calculations.
- All numerical values should be rounded to two decimal places.

In [23]:
### START FUNCTION
def five_num_summ(items):

    ord_items = sorted(items)
    minimum = min(ord_items)
    maximum = max(ord_items)

    #=====Calculating Median=====#
    def quartile(n_percentile):
        percentile = len(ord_items) * (n_percentile/100)
        if percentile == int(percentile):
              return round((ord_items[int(percentile)-1]+ord_items[int(percentile)])/2,2)
        else:
              return round(ord_items[int(percentile)],2)

     

     ## calculate q1 using numpy.percentile
    q1 = round(np.percentile(items, 25), 2)  

     ## calculate q3 using numpy.percentile
    q3 = round(np.percentile(items, 75), 2)
  
    dictionary = {'max':maximum, 'median':quartile(50), 'min':minimum, 'q1':q1, 'q3':q3}
  
    return dictionary

### END FUNCTION

In [24]:
five_num_summ(gauteng)

{'max': 39660.0,
 'median': 24403.5,
 'min': 8842.0,
 'q1': 18653.0,
 'q3': 36372.0}

_**Expected Output:**_

```python
five_num_summary(gauteng) == {
    'max': 39660.0,
    'median': 24403.5,
    'min': 8842.0,
    'q1': 18653.0,
    'q3': 36372.0
}

```

## Function 3: Date Parser

The `dates` variable (created at the top of this notebook) is a list of dates represented as strings. The string contains the date in `'yyyy-mm-dd'` format, as well as the time in `hh:mm:ss` formamt. The first three entries in this variable are:
```python
dates[:3] == [
    '2019-11-29 12:50:54',
    '2019-11-29 12:46:53',
    '2019-11-29 12:46:10'
]
```

Write a function that takes as input a list of these datetime strings and returns only the date in `'yyyy-mm-dd'` format.

**Function Specifications:**
- The function should take a list of strings as input.
- Each string in the input list is formatted as `'yyyy-mm-dd hh:mm:ss'`.
- The function should return a list of strings where each element in the returned list contains only the date in the `'yyyy-mm-dd'` format.

In [25]:
### START FUNCTION
def date_parser(dates):

    """ A function that takes in a list of strings in datetime format (yyyy-mm-dd) and returns a list of strings containing only the date (yyyy-mm-dd) """

    # Create an empty list
    new_format = []

    # Iterate through the elements of the list dates
    for i in dates:
        # Split each element by the space seperator
        x = i.split(' ')

        # Append the zero index of the split element to the empty list
        new_format.append(x[0])

    # Return the list
    return new_format

### END FUNCTION

In [26]:
date_parser(dates[:3])

_**Expected Output:**_

```python
date_parser(dates[:3]) == ['2019-11-29', '2019-11-29', '2019-11-29']
date_parser(dates[-3:]) == ['2019-11-20', '2019-11-20', '2019-11-20']
```

## Function 4: Municipality & Hashtag Detector

Write a function which takes in a pandas dataframe and returns a modified dataframe that includes two new columns that contain information about the municipality and hashtag of the tweet.

**Function Specifications:**
* Function should take a pandas `dataframe` as input.
* Extract the municipality from a tweet using the `mun_dict` dictonary given below, and insert the result into a new column named `'municipality'` in the same dataframe.
* Use the entry `np.nan` when a municipality is not found.
* Extract a list of hashtags from a tweet into a new column named `'hashtags'` in the same dataframe.
* Use the entry `np.nan` when no hashtags are found.

**Hint:** you will need to `mun_dict` variable defined at the top of this notebook.

```

In [27]:
### START FUNCTION
def extract_municipality_hashtags(df):
    # your code here

    #Convert dataframe input to dictionary
    twitter_df_dict = df.to_dict('list')

    #Create an empty variable: cities
    cities = []

    #Create an empty variable: hash_tags
    hash_tags = []

    #For-loop [Tweets] from dictionary variable
    for x in twitter_df_dict['Tweets']:
      #Check only '@' character in a string and check the mun_dict to append the cities in an empty variable of cities or not, if not matched
        if '@' in x:
            if '@CityofCTAlerts' in x:
                cities.append(mun_dict['@CityofCTAlerts'])
            elif '@CityPowerJhb' in x:
                cities.append(mun_dict['@CityPowerJhb'])
            elif '@eThekwiniM' in x:
                cities.append(mun_dict['@eThekwiniM']) 
            elif '@EMMInfo' in x:
                cities.append(mun_dict['@EMMInfo'])
            elif '@centlecutility' in x:
                cities.append(mun_dict['@centlecutility'])
            elif '@NMBmunicipality' in x:
                cities.append(mun_dict['NMBmunicipality'])
            elif '@CityTshwane' in x:
                cities.append(mun_dict['@NMBmunicipality'])
            else:
                cities.append(np.nan)
        else:
             cities.append(np.nan)
    #Another same for-loop as previous
    for y in twitter_df_dict['Tweets']:
      #Check the hashtags in a string and append the splitted of only hastags strings or not, if not matched
        if '#' in y:
            hash_tags.append([i.lower() for i in y.split() if i[0] == '#'])
        else:
            hash_tags.append(np.nan)
    
    #Create a two new column: municipality and hashtags        
    twitter_df_dict['municipality'] = cities
    twitter_df_dict['hashtags'] = hash_tags

    #Convert back to dataframe
    back_on_df = pd.DataFrame(twitter_df_dict)
    
    return back_on_df

### END FUNCTION

In [28]:
extract_municipality_hashtags(twitter_df.copy())

Unnamed: 0,Tweets,Date,municipality,hashtags
0,@BongaDlulane Please send an email to mediades...,2019-11-29 12:50:54,,
1,@saucy_mamiie Pls log a call on 0860037566,2019-11-29 12:46:53,,
2,@BongaDlulane Query escalated to media desk.,2019-11-29 12:46:10,,
3,"Before leaving the office this afternoon, head...",2019-11-29 12:33:36,,
4,#ESKOMFREESTATE #MEDIASTATEMENT : ESKOM SUSPEN...,2019-11-29 12:17:43,,"[#eskomfreestate, #mediastatement]"
...,...,...,...,...
195,Eskom's Visitors Centres’ facilities include i...,2019-11-20 10:29:07,,
196,#Eskom connected 400 houses and in the process...,2019-11-20 10:25:20,,"[#eskom, #eskom, #poweringyourworld]"
197,@ArthurGodbeer Is the power restored as yet?,2019-11-20 10:07:59,,
198,@MuthambiPaulina @SABCNewsOnline @IOL @eNCA @e...,2019-11-20 10:07:41,,


_**Expected Outputs:**_ 

```python

extract_municipality_hashtags(twitter_df.copy())

```
> <table class="dataframe" border="1">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>Tweets</th>
      <th>Date</th>
      <th>municipality</th>
      <th>hashtags</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>@BongaDlulane Please send an email to mediades...</td>
      <td>2019-11-29 12:50:54</td>
      <td>NaN</td>
      <td>NaN</td>
    </tr>
    <tr>
      <th>1</th>
      <td>@saucy_mamiie Pls log a call on 0860037566</td>
      <td>2019-11-29 12:46:53</td>
      <td>NaN</td>
      <td>NaN</td>
    </tr>
    <tr>
      <th>2</th>
      <td>@BongaDlulane Query escalated to media desk.</td>
      <td>2019-11-29 12:46:10</td>
      <td>NaN</td>
      <td>NaN</td>
    </tr>
    <tr>
      <th>3</th>
      <td>Before leaving the office this afternoon, head...</td>
      <td>2019-11-29 12:33:36</td>
      <td>NaN</td>
      <td>NaN</td>
    </tr>
    <tr>
      <th>4</th>
      <td>#ESKOMFREESTATE #MEDIASTATEMENT : ESKOM SUSPEN...</td>
      <td>2019-11-29 12:17:43</td>
      <td>NaN</td>
      <td>[#eskomfreestate, #mediastatement]</td>
    </tr>
    <tr>
      <th>...</th>
      <td>...</td>
      <td>...</td>
      <td>...</td>
      <td>...</td>
    </tr>
    <tr>
      <th>195</th>
      <td>Eskom's Visitors Centresâ€™ facilities include i...</td>
      <td>2019-11-20 10:29:07</td>
      <td>NaN</td>
      <td>NaN</td>
    </tr>
    <tr>
      <th>196</th>
      <td>#Eskom connected 400 houses and in the process...</td>
      <td>2019-11-20 10:25:20</td>
      <td>NaN</td>
      <td>[#eskom, #eskom, #poweringyourworld]</td>
    </tr>
    <tr>
      <th>197</th>
      <td>@ArthurGodbeer Is the power restored as yet?</td>
      <td>2019-11-20 10:07:59</td>
      <td>NaN</td>
      <td>NaN</td>
    </tr>
    <tr>
      <th>198</th>
      <td>@MuthambiPaulina @SABCNewsOnline @IOL @eNCA @e...</td>
      <td>2019-11-20 10:07:41</td>
      <td>NaN</td>
      <td>NaN</td>
    </tr>
    <tr>
      <th>199</th>
      <td>RT @GP_DHS: The @GautengProvince made a commit...</td>
      <td>2019-11-20 10:00:09</td>
      <td>NaN</td>
      <td>NaN</td>
    </tr>
  </tbody>
</table>

## Function 5: Number of Tweets per Day

Write a function which calculates the number of tweets that were posted per day. 

**Function Specifications:**
- It should take a pandas dataframe as input.
- It should return a new dataframe, grouped by day, with the number of tweets for that day.
- The index of the new dataframe should be named `Date`, and the column of the new dataframe should be `'Tweets'`, corresponding to the date and number of tweets, respectively.
- The date should be formated as `yyyy-mm-dd`, and should be a datetime object. **Hint:** look up `pd.to_datetime` to see how to do this.

In [29]:
### START FUNCTION
def number_of_tweets_per_day(df):
    # your code here

    #Convert the DataFrame to Dictionary
    tweet_dict = df.to_dict('list')

    #Make an empty variable [number_of_tweets]
    number_of_tweets = []

    #Make a for-loop of Tweets from dictionary and append number of each list of Tweets
    for x in tweet_dict['Tweets']:
        number_of_tweets.append(len([x]))

    #Modify the Tweets dictionary  with the variable [number_of_tweets]
    tweet_dict['Tweets'] = number_of_tweets

    #Convert the Dictionary [tweet_dict] to new DataFrame
    new_df = pd.DataFrame(tweet_dict)

    #Convert the Date column of datetime from DataFrame to only date
    new_df['Date'] = pd.to_datetime(new_df['Date']).dt.date

    #Use the Groupby() function and sum all the similar date.
    grouped_by_day = new_df.groupby('Date').sum()

    return grouped_by_day

### END FUNCTION

In [30]:
number_of_tweets_per_day(twitter_df.copy())

Unnamed: 0_level_0,Tweets
Date,Unnamed: 1_level_1
2019-11-20,18
2019-11-21,11
2019-11-22,25
2019-11-23,19
2019-11-24,14
2019-11-25,20
2019-11-26,32
2019-11-27,13
2019-11-28,32
2019-11-29,16


_**Expected Output:**_

```python

number_of_tweets_per_day(twitter_df.copy())

```

> <table class="dataframe" border="1">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>Tweets</th>
    </tr>
    <tr>
      <th>Date</th>
      <th></th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>2019-11-20</th>
      <td>18</td>
    </tr>
    <tr>
      <th>2019-11-21</th>
      <td>11</td>
    </tr>
    <tr>
      <th>2019-11-22</th>
      <td>25</td>
    </tr>
    <tr>
      <th>2019-11-23</th>
      <td>19</td>
    </tr>
    <tr>
      <th>2019-11-24</th>
      <td>14</td>
    </tr>
    <tr>
      <th>2019-11-25</th>
      <td>20</td>
    </tr>
    <tr>
      <th>2019-11-26</th>
      <td>32</td>
    </tr>
    <tr>
      <th>2019-11-27</th>
      <td>13</td>
    </tr>
    <tr>
      <th>2019-11-28</th>
      <td>32</td>
    </tr>
    <tr>
      <th>2019-11-29</th>
      <td>16</td>
    </tr>
  </tbody>
</table>

# Function 6: Word Splitter

Write a function which splits the sentences in a dataframe's column into a list of the separate words. The created lists should be placed in a column named `'Split Tweets'` in the original dataframe. This is also known as [tokenization](https://www.geeksforgeeks.org/nlp-how-tokenizing-text-sentence-words-works/).

**Function Specifications:**
- It should take a pandas dataframe as an input.
- The dataframe should contain a column, named `'Tweets'`.
- The function should split the sentences in the `'Tweets'` into a list of seperate words, and place the result into a new column named `'Split Tweets'`. The resulting words must all be lowercase!
- The function should modify the input dataframe directly.
- The function should return the modified dataframe.

In [31]:
### START FUNCTION
def word_splitter(df):
    """ A function which splits the sentences in a dataframe's column into a list of the        separate words.
    """
    # Create an empty list of tweets
    split_tweets = []
    # Iterate through the Tweets column on the dataframe
    for tweet in df['Tweets'].iteritems():
      # Split each tweet to lists of single words and append to split tweets
      split_tweets.append(tweet[1].split(' '))
    # Transform every letter on the tweets to lowercase
    split_tweets = [[word.lower() for word in tweet] for tweet in split_tweets]
    # Add a Split Tweets column to the dataframe
    df = df.assign(Split_Tweets = split_tweets).rename(columns={'Split_Tweets': 'Split Tweets'})
    # Return the dataframe
    return df
### END FUNCTION

In [32]:
word_splitter(twitter_df.copy())

Unnamed: 0,Tweets,Date,Split Tweets
0,@BongaDlulane Please send an email to mediades...,2019-11-29 12:50:54,"[@bongadlulane, please, send, an, email, to, m..."
1,@saucy_mamiie Pls log a call on 0860037566,2019-11-29 12:46:53,"[@saucy_mamiie, pls, log, a, call, on, 0860037..."
2,@BongaDlulane Query escalated to media desk.,2019-11-29 12:46:10,"[@bongadlulane, query, escalated, to, media, d..."
3,"Before leaving the office this afternoon, head...",2019-11-29 12:33:36,"[before, leaving, the, office, this, afternoon..."
4,#ESKOMFREESTATE #MEDIASTATEMENT : ESKOM SUSPEN...,2019-11-29 12:17:43,"[#eskomfreestate, #mediastatement, :, eskom, s..."
...,...,...,...
195,Eskom's Visitors Centres’ facilities include i...,2019-11-20 10:29:07,"[eskom's, visitors, centres’, facilities, incl..."
196,#Eskom connected 400 houses and in the process...,2019-11-20 10:25:20,"[#eskom, connected, 400, houses, and, in, the,..."
197,@ArthurGodbeer Is the power restored as yet?,2019-11-20 10:07:59,"[@arthurgodbeer, is, the, power, restored, as,..."
198,@MuthambiPaulina @SABCNewsOnline @IOL @eNCA @e...,2019-11-20 10:07:41,"[@muthambipaulina, @sabcnewsonline, @iol, @enc..."


_**Expected Output**_:

```python

word_splitter(twitter_df.copy()) 

```

> <table class="dataframe" border="1">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>Tweets</th>
      <th>Date</th>
      <th>Split Tweets</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>@BongaDlulane Please send an email to mediades...</td>
      <td>2019-11-29 12:50:54</td>
      <td>[@bongadlulane, please, send, an, email, to, m...</td>
    </tr>
    <tr>
      <th>1</th>
      <td>@saucy_mamiie Pls log a call on 0860037566</td>
      <td>2019-11-29 12:46:53</td>
      <td>[@saucy_mamiie, pls, log, a, call, on, 0860037...</td>
    </tr>
    <tr>
      <th>2</th>
      <td>@BongaDlulane Query escalated to media desk.</td>
      <td>2019-11-29 12:46:10</td>
      <td>[@bongadlulane, query, escalated, to, media, d...</td>
    </tr>
    <tr>
      <th>3</th>
      <td>Before leaving the office this afternoon, head...</td>
      <td>2019-11-29 12:33:36</td>
      <td>[before, leaving, the, office, this, afternoon...</td>
    </tr>
    <tr>
      <th>4</th>
      <td>#ESKOMFREESTATE #MEDIASTATEMENT : ESKOM SUSPEN...</td>
      <td>2019-11-29 12:17:43</td>
      <td>[#eskomfreestate, #mediastatement, :, eskom, s...</td>
    </tr>
    <tr>
      <th>...</th>
      <td>...</td>
      <td>...</td>
      <td>...</td>
    </tr>
    <tr>
      <th>195</th>
      <td>Eskom's Visitors Centresâ€™ facilities include i...</td>
      <td>2019-11-20 10:29:07</td>
      <td>[eskom's, visitors, centresâ€™, facilities, incl...</td>
    </tr>
    <tr>
      <th>196</th>
      <td>#Eskom connected 400 houses and in the process...</td>
      <td>2019-11-20 10:25:20</td>
      <td>[#eskom, connected, 400, houses, and, in, the,...</td>
    </tr>
    <tr>
      <th>197</th>
      <td>@ArthurGodbeer Is the power restored as yet?</td>
      <td>2019-11-20 10:07:59</td>
      <td>[@arthurgodbeer, is, the, power, restored, as,...</td>
    </tr>
    <tr>
      <th>198</th>
      <td>@MuthambiPaulina @SABCNewsOnline @IOL @eNCA @e...</td>
      <td>2019-11-20 10:07:41</td>
      <td>[@muthambipaulina, @sabcnewsonline, @iol, @enc...</td>
    </tr>
    <tr>
      <th>199</th>
      <td>RT @GP_DHS: The @GautengProvince made a commit...</td>
      <td>2019-11-20 10:00:09</td>
      <td>[rt, @gp_dhs:, the, @gautengprovince, made, a,...</td>
    </tr>
  </tbody>
</table>

# Function 7: Stop Words

Write a function which removes english stop words from a tweet.

**Function Specifications:**
- It should take a pandas dataframe as input.
- Should tokenise the sentences according to the definition in function 6. Note that function 6 **cannot be called within this function**.
- Should remove all stop words in the tokenised list. The stopwords are defined in the `stop_words_dict` variable defined at the top of this notebook.
- The resulting tokenised list should be placed in a column named `"Without Stop Words"`.
- The function should modify the input dataframe.
- The function should return the modified dataframe.


In [33]:
### START FUNCTION
def stop_words_remover(df):
    # your code here
    
    #Creating lambda function by joining the loop of lowercase list of string and remove the 'stopwords'
    lambda_f = lambda x: ' '.join([item for item in x.lower().split() if item not in stop_words_dict['stopwords']])
    
    #Applying the lambda function to the dataframe and name new column as "Without Stop Words"
    df['Without Stop Words'] = df["Tweets"].apply(lambda_f)
    
    #Split the list of string in a new column
    df['Without Stop Words'] = df['Without Stop Words'].str.split()
    return df

### END FUNCTION

In [34]:
stop_words_remover(twitter_df.copy())

Unnamed: 0,Tweets,Date,Without Stop Words
0,@BongaDlulane Please send an email to mediades...,2019-11-29 12:50:54,"[@bongadlulane, send, email, mediadesk@eskom.c..."
1,@saucy_mamiie Pls log a call on 0860037566,2019-11-29 12:46:53,"[@saucy_mamiie, pls, log, 0860037566]"
2,@BongaDlulane Query escalated to media desk.,2019-11-29 12:46:10,"[@bongadlulane, query, escalated, media, desk.]"
3,"Before leaving the office this afternoon, head...",2019-11-29 12:33:36,"[leaving, office, afternoon,, heading, weekend..."
4,#ESKOMFREESTATE #MEDIASTATEMENT : ESKOM SUSPEN...,2019-11-29 12:17:43,"[#eskomfreestate, #mediastatement, :, eskom, s..."
...,...,...,...
195,Eskom's Visitors Centres’ facilities include i...,2019-11-20 10:29:07,"[eskom's, visitors, centres’, facilities, incl..."
196,#Eskom connected 400 houses and in the process...,2019-11-20 10:25:20,"[#eskom, connected, 400, houses, process, conn..."
197,@ArthurGodbeer Is the power restored as yet?,2019-11-20 10:07:59,"[@arthurgodbeer, power, restored, yet?]"
198,@MuthambiPaulina @SABCNewsOnline @IOL @eNCA @e...,2019-11-20 10:07:41,"[@muthambipaulina, @sabcnewsonline, @iol, @enc..."


_**Expected Output**_:

Specific rows:

```python
stop_words_remover(twitter_df.copy()).loc[0, "Without Stop Words"] == ['@bongadlulane', 'send', 'email', 'mediadesk@eskom.co.za']
stop_words_remover(twitter_df.copy()).loc[100, "Without Stop Words"] == ['#eskomnorthwest', '#mediastatement', ':', 'notice', 'supply', 'interruption', 'lichtenburg', 'area', 'https://t.co/7hfwvxllit']
```

Whole table:
```python
stop_words_remover(twitter_df.copy())
```

> <table class="dataframe" border="1">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>Tweets</th>
      <th>Date</th>
      <th>Without Stop Words</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>@BongaDlulane Please send an email to mediades...</td>
      <td>2019-11-29 12:50:54</td>
      <td>[@bongadlulane, send, email, mediadesk@eskom.c...</td>
    </tr>
    <tr>
      <th>1</th>
      <td>@saucy_mamiie Pls log a call on 0860037566</td>
      <td>2019-11-29 12:46:53</td>
      <td>[@saucy_mamiie, pls, log, 0860037566]</td>
    </tr>
    <tr>
      <th>2</th>
      <td>@BongaDlulane Query escalated to media desk.</td>
      <td>2019-11-29 12:46:10</td>
      <td>[@bongadlulane, query, escalated, media, desk.]</td>
    </tr>
    <tr>
      <th>3</th>
      <td>Before leaving the office this afternoon, head...</td>
      <td>2019-11-29 12:33:36</td>
      <td>[leaving, office, afternoon,, heading, weekend...</td>
    </tr>
    <tr>
      <th>4</th>
      <td>#ESKOMFREESTATE #MEDIASTATEMENT : ESKOM SUSPEN...</td>
      <td>2019-11-29 12:17:43</td>
      <td>[#eskomfreestate, #mediastatement, :, eskom, s...</td>
    </tr>
    <tr>
      <th>...</th>
      <td>...</td>
      <td>...</td>
      <td>...</td>
    </tr>
    <tr>
      <th>195</th>
      <td>Eskom's Visitors Centresâ€™ facilities include i...</td>
      <td>2019-11-20 10:29:07</td>
      <td>[eskom's, visitors, centresâ€™, facilities, incl...</td>
    </tr>
    <tr>
      <th>196</th>
      <td>#Eskom connected 400 houses and in the process...</td>
      <td>2019-11-20 10:25:20</td>
      <td>[#eskom, connected, 400, houses, process, conn...</td>
    </tr>
    <tr>
      <th>197</th>
      <td>@ArthurGodbeer Is the power restored as yet?</td>
      <td>2019-11-20 10:07:59</td>
      <td>[@arthurgodbeer, power, restored, yet?]</td>
    </tr>
    <tr>
      <th>198</th>
      <td>@MuthambiPaulina @SABCNewsOnline @IOL @eNCA @e...</td>
      <td>2019-11-20 10:07:41</td>
      <td>[@muthambipaulina, @sabcnewsonline, @iol, @enc...</td>
    </tr>
    <tr>
      <th>199</th>
      <td>RT @GP_DHS: The @GautengProvince made a commit...</td>
      <td>2019-11-20 10:00:09</td>
      <td>[rt, @gp_dhs:, @gautengprovince, commitment, e...</td>
    </tr>
  </tbody>
</table>