# Remote work sentiment analysis 
# Preprocessing


# Contents

* [0. Resources](#0_Resources)
* [1. Setup](#1_Setup)
    * [1.1 Import](#1.1_Import)
    * [1.2 Load Data](#1.2_Scrape_Data)
* [2. Shape Data](#2_Shape_Data)
    * [2.1 Remove non english tweets](#2.1_Remove_Non_English)
    * [2.2 Remove hiring and ads tweets](#2.2_Remove_Hiring)
    * [2.3 Remove username](#2.3_Remove_Username)
* [3. Remove unwanted words and characters](#3_Remove_Unwanted)
    * [3.1 Remove Http](#3.1_Remove_Http)
    * [3.2 Remove Emoji](#3.2_Remove_Emoji)
    * [3.3 Remove Mention](#3.3_Remove_Mention)
    * [3.4 Remove Special Characters and Numbers](#3.4_Remove_Specials)
* [4. Lemmatizing](#4_Lemmatizing)
* [5. Remove Stopwords](#5_Remove_Stopwords)
* [6. Remove Duplicates](#6_Remove_Duplicates)
* [7. Save_Data](#7_Save_Data)
* [8. Pipeline Functions](#8_Pipeline)
    * [8.1 Remove Ending Hashtags](#8.1_Remove_Ending_Hashtags)
    * [8.2 Split Hashtags Words](#8.2_Split_Hashtags_Words)
    * [8.3 Remove Less Than 4](#8.3_Remove_Less_Than_4)


# 0-Resources<a id='0_Resources'></a>

github <br>
https://github.com/mdipietro09/DataScience_ArtificialIntelligence_Utils/blob/master/natural_language_processing/example_text_classification.ipynb

article 1<br>
text analysis and feature engineering <br>
https://towardsdatascience.com/text-analysis-feature-engineering-with-nlp-502d6ea9225d

article 2<br>
Text classification tdidf word2vec bert <br>
https://towardsdatascience.com/text-classification-with-nlp-tf-idf-vs-word2vec-vs-bert-41ff868d1794

article 3<br>
BERT with no model training<br>
https://towardsdatascience.com/text-classification-with-no-model-training-935fe0e42180


**Topic Modelling**<br>
LDA on trump tweets<br>
https://medium.datadriveninvestor.com/trump-tweets-topic-modeling-using-latent-dirichlet-allocation-e4f93b90b6fe

Bert keyword<br>
https://towardsdatascience.com/keyword-extraction-with-bert-724efca412ea

Bert topic<br>
https://towardsdatascience.com/topic-modeling-with-bert-779f7db187e6

# 1-Setup<a id='1_Setup'></a>

## 1.1 Import<a id='1.1_Setup'></a>

In [1]:
## for scraping
import twint

# Solve compatibility issues with notebook and RunTime errors
import nest_asyncio
nest_asyncio.apply()

## for data
import json
import pandas as pd
import numpy as np
from sklearn import metrics, manifold

## for pre-processing
import re
import nltk
# nltk.download('wordnet')

## for plotting
import matplotlib.pyplot as plt
import seaborn as sns

## for language detection
import langdetect
import spacy
from spacy_langdetect import LanguageDetector


## 1.2 Scrape Data<a id='1.2_Scrape_Data'></a>

In [2]:
# Configure
config = twint.Config()

config.Search = "remote work"
config.Lang = "en"
config.Since = "2020-08-01"
config.Until = "2021-05-13"
config.Limit = 10000
config.Pandas = True
config.Filter_retweets = True


# Run
twint.run.Search(config)

1392630666471297029 2021-05-13 06:59:59 +0700 <RecruiterDotCom> Recruiter .com's April 2021 recruiter index® has found that the demand for in-person jobs is outpacing that of remote work.   Hit the link below for the full Recruiter Index®.   https://t.co/vh2wufdTpO #recruiterindex #recruiters #recruitment #jobmarket #labormarket #hiringtrends
1392630316494295042 2021-05-13 06:58:35 +0700 <rucsb> @TimSackett @lruettimann @FrankZupan @Lars +1. Real Estate Market would crash if there is no demand for commercial space.   Hybrid work / Remote work works . If we design for it. For decades, Office space worked as space to socialize with fellow human beings.
1392630178359042054 2021-05-13 06:58:02 +0700 <ArneEkstrom1> Congratulations to Dr. Michael Starrett on successfully defending his dissertation!  Mike worked on everything from immersive VR with a treadmill to remote VR testing!  Great work Mike and looking forward to seeing what you do next with Dr. Liz Chrastil at UC Irvine!
139262949886

In [3]:
# Get column names
columns_names = twint.output.panda.Tweets_df.columns
columns_names

Index(['id', 'conversation_id', 'created_at', 'date', 'timezone', 'place',
       'tweet', 'language', 'hashtags', 'cashtags', 'user_id', 'user_id_str',
       'username', 'name', 'day', 'hour', 'link', 'urls', 'photos', 'video',
       'thumbnail', 'retweet', 'nlikes', 'nreplies', 'nretweets', 'quote_url',
       'search', 'near', 'geo', 'source', 'user_rt_id', 'user_rt',
       'retweet_id', 'reply_to', 'retweet_date', 'translate', 'trans_src',
       'trans_dest'],
      dtype='object')

In [4]:
# Save as data frame 
# Lets only keep date, tweet, and username
df = twint.output.panda.Tweets_df[['date','tweet', 'username']]


In [5]:
# Save as CSV
df.to_csv('tweets.csv',index =False)


In [6]:
# Reload Data
df = pd.read_csv('tweets.csv')


# 2 - Shape Data<a id='2_Shape_Data'></a>

<span style="background-color:Teal">"Many entries will not be useful for our sentiment analysis lets go through the tweets and see what we can remove<span>

In [7]:
df.head(20).tweet

0     Recruiter .com's April 2021 recruiter index® h...
1     @TimSackett @lruettimann @FrankZupan @Lars +1....
2     Congratulations to Dr. Michael Starrett on suc...
3     @AlvarezGibson Concur. My company was 100% "yo...
4     @mattreign Why not ask if we really need that ...
5     Dear Line Managers, Appraisal your subordinate...
6     ¿Quiere viajar y trabajar? Te contamos todo lo...
7     @GlobeIdeas @JonLevyTLB I've had more opportun...
8     Study reveals growing cybersecurity risks driv...
9     @joeywreck I'm lucky, mine is moving to a hybr...
10    Shifting to a #remotework environment created ...
11    Now hiring: devops engineers. I need 4 people ...
12    Hi @usepixie are Hiring Remote!  (Anywhere 100...
13    @heybereket  https://t.co/ynwfexrjuI - profess...
14    Last day of my temp job today - going to try t...
15    @JoelGratcyk I feel your pain.   I was looking...
16    Learn the top business functions to outsource ...
17    We’re #hiring a P/T In House #Editor! Come

<span style="background-color:Teal">We see on line 19 and 6 that some sentences are not in english and http links<span>

## 2.1 Remove non english tweets<a id = '2.1_Remove_Non_English'></a>

### 2.1.1 Remove rows that langdetect cannot detect ( links and language with special characters)

In [8]:
num_of_rows = len(df)
indices_with_error = []
for i in range(num_of_rows):
    # get text
    text = df.tweet.iloc[i]
    try:
        language = langdetect.detect(text)
    except:
        language = "error"
        print("This row throws and error:", text, 'at index', i)
        # get indices that will throw error
        indices_with_error.append(i)
    
indices_with_error

This row throws and error: Trend ຂອງໂລກຈະໄປ remote work ຖ້າເຮົາເກາະກະແສນີ້ ຈະສາມາດສ້າງປະໂຫຍດໃຫ້ປະເທດຊາດຫລາຍ at index 3521


[3521]

In [9]:
# remove rows
df = df.drop(indices_with_error,axis = 0)
df = df.reset_index(drop = True)
df

Unnamed: 0,date,tweet,username
0,2021-05-13 06:59:59,Recruiter .com's April 2021 recruiter index® h...,RecruiterDotCom
1,2021-05-13 06:58:35,@TimSackett @lruettimann @FrankZupan @Lars +1....,rucsb
2,2021-05-13 06:58:02,Congratulations to Dr. Michael Starrett on suc...,ArneEkstrom1
3,2021-05-13 06:55:20,"@AlvarezGibson Concur. My company was 100% ""yo...",thirdnline
4,2021-05-13 06:55:01,@mattreign Why not ask if we really need that ...,toofarnorth49
...,...,...,...
10004,2021-05-10 00:30:32,A majority of people want full-time remote wor...,chris_herd
10005,2021-05-10 00:30:20,Tech makes more gig work possible while covid ...,AdventuresOTM
10006,2021-05-10 00:29:42,Catherine Merrill should lose her job. The abs...,SukolVentures
10007,2021-05-10 00:29:26,Before we even get to prediction: the OBSERVAT...,MASHAgindler


In [10]:
# check if it will stil throw error
num_of_rows = len(df)
indices_with_error = []
for i in range(num_of_rows):
    # get text
    text = df.tweet.iloc[i]
    try:
        language = langdetect.detect(text)
    except:
        language = "error"
        print("This row throws and error:", text, 'at index', i)
        # get indices that will throw error
        indices_with_error.append(i)
  

In [11]:
indices_with_error  

[]

<span style="background-color:Teal">We've removed all rows that has no language. Now lets get all the tweet language. and remove the tweets that are not in english<span>

### 2.1.2 Remove tweets that are not english

In [12]:
# 1. get the list of language
language_list = []
for i in range(num_of_rows):
    # get text
    text = df.tweet.iloc[i]
    language = langdetect.detect(text)
    # get language
    language_list.append(language)
    

In [13]:
# 2. append the list to our dataframe
df["Language"] = language_list
df.iloc[19]

date                                      2021-05-13 06:43:24
tweet       belajar belajar untuk melakukan remote code ex...
username                                            bpptwiter
Language                                                   id
Name: 19, dtype: object

In [14]:
# 3. remove rows that are not in english
df = df[df["Language"] == 'en']

In [15]:
df.Language.value_counts()

en    9787
Name: Language, dtype: int64

All tweets are now in english

In [16]:
# reset index
df = df.reset_index(drop = True)
df.head(10)

Unnamed: 0,date,tweet,username,Language
0,2021-05-13 06:59:59,Recruiter .com's April 2021 recruiter index® h...,RecruiterDotCom,en
1,2021-05-13 06:58:35,@TimSackett @lruettimann @FrankZupan @Lars +1....,rucsb,en
2,2021-05-13 06:58:02,Congratulations to Dr. Michael Starrett on suc...,ArneEkstrom1,en
3,2021-05-13 06:55:20,"@AlvarezGibson Concur. My company was 100% ""yo...",thirdnline,en
4,2021-05-13 06:55:01,@mattreign Why not ask if we really need that ...,toofarnorth49,en
5,2021-05-13 06:54:11,"Dear Line Managers, Appraisal your subordinate...",uzomabenny,en
6,2021-05-13 06:52:32,@GlobeIdeas @JonLevyTLB I've had more opportun...,juliethrelkeld,en
7,2021-05-13 06:52:22,Study reveals growing cybersecurity risks driv...,docangelmtz1,en
8,2021-05-13 06:51:45,"@joeywreck I'm lucky, mine is moving to a hybr...",Diablerie617,en
9,2021-05-13 06:51:00,Shifting to a #remotework environment created ...,eclypsium,en


<span style="background-color:Teal">Also notice that all the tweets with http links are mostly job openings and links to articles without sentiments. Lets see the tweets with http links and see if we need them<span>

### 2.1.3 Check tweets that have http links

In [17]:
def match_regex(regex, the_df, column_name):
    '''function returns the index in the dataframe that matches the given regex'''
    indices_that_match = []
    end_index = len(df)
    for i in range(end_index):
        matched_words = []
        if (column_name == 'tweet'):
            matched_words = re.findall(regex, df.iloc[i].tweet)
        else:
            matched_words = re.findall(regex, df.iloc[i].username)
            
        # if this row is to be removed(because there is a match)    
        if len(matched_words) != 0:
            indices_that_match.append(i)
    return indices_that_match


In [18]:

def print_regex(indices_that_match, the_df, column_name):
    '''Function that print all the tweets in indices'''
    if (len(the_df)>=20) and (len(indices_that_match) >=20):
        for i in range(20):
            index = indices_that_match[i] 
            if column_name == 'tweet':
                print(the_df.iloc[index].tweet,'\n')
            else:
                print(the_df.iloc[index].username,'\n')
    else:
        print('less than 20 entries that in the index')

In [19]:
# prepare regex
regex_url = r'(https?://[^\s]+)'
indices_with_url = []

indices_with_url = match_regex(regex_url, df, 'tweet')


<span style="background-color:Teal">It seems like there are many indices with url lets check the proportion:<span>

In [20]:
len(indices_with_url)/len(df)

0.6684377235107796

<span style="background-color:Teal"><span style="background-color:Teal">More than half have url. Should we get rid of all of them? Lets check the tweet<span>

In [21]:
print_regex(indices_with_url,df,'tweet')

Recruiter .com's April 2021 recruiter index® has found that the demand for in-person jobs is outpacing that of remote work.   Hit the link below for the full Recruiter Index®.   https://t.co/vh2wufdTpO #recruiterindex #recruiters #recruitment #jobmarket #labormarket #hiringtrends 

Study reveals growing cybersecurity risks driven by remote work  #cyber #CyberSecurity #cybercrime #CyberAttack #cyberdefense  https://t.co/8QWdBG1mFU 

Shifting to a #remotework environment created challenges for many businesses &amp; government institutions. New tools allow you to gain visibility into #firmware vulnerabilities, #hardware misconfigurations, and other #cyberthreats. 👉  https://t.co/vFirFiQGl3 #firmsec #CyberSecurity  https://t.co/fhon6WXEza 

Hi @usepixie are Hiring Remote!  (Anywhere 100% Remote)  👉 Product Designer  Apply now 👇 #remotejob #hiring #design   https://t.co/7mdXI0v82K 

@heybereket  https://t.co/ynwfexrjuI - professionals from a range of industries who now specialise in SMM/SEO

<span style="background-color:Teal"> We cannot remove all of the tweets with links because some of them mention the struggle of 'isolation': <br>
    
<span style="background-color:Teal"> *Another day of isolation done and dusted. I ticked everything off my list again. Today's subject CPD was provided by the Remote CPD section on @LitdriveUK and then this afternoon's session was provided by the Director of Teaching and Learning at work* <br>
    
<span style="background-color:Teal"> But lets remove tweets with the hashtag #remotejobs #RemoteJobs #Hiring #hiring #HIRINGNOW
    </span>



### 2.1.4 Save df after filtering english only

In [22]:
df.to_csv('tweets.csv',index =False)


In [23]:
# restart from here
df = pd.read_csv('tweets.csv')
df

Unnamed: 0,date,tweet,username,Language
0,2021-05-13 06:59:59,Recruiter .com's April 2021 recruiter index® h...,RecruiterDotCom,en
1,2021-05-13 06:58:35,@TimSackett @lruettimann @FrankZupan @Lars +1....,rucsb,en
2,2021-05-13 06:58:02,Congratulations to Dr. Michael Starrett on suc...,ArneEkstrom1,en
3,2021-05-13 06:55:20,"@AlvarezGibson Concur. My company was 100% ""yo...",thirdnline,en
4,2021-05-13 06:55:01,@mattreign Why not ask if we really need that ...,toofarnorth49,en
...,...,...,...,...
9782,2021-05-10 00:30:32,A majority of people want full-time remote wor...,chris_herd,en
9783,2021-05-10 00:30:20,Tech makes more gig work possible while covid ...,AdventuresOTM,en
9784,2021-05-10 00:29:42,Catherine Merrill should lose her job. The abs...,SukolVentures,en
9785,2021-05-10 00:29:26,Before we even get to prediction: the OBSERVAT...,MASHAgindler,en


## 2.2 Remove advert tweets <a id = '2.2_Remove_Hiring'></a>

### 2.2.1 Match by hashtags

In [24]:
def print_tweet(the_df, end_index):
    for i in range(0,end_index):
        print(i)
        print(the_df.iloc[i].tweet,'\n')
        

In [25]:
# Match by 
# (?i) makes it match case insensitive and
text = 'But lets remove tweets with the hashtag #remotejobs #RemoteJobs #Hiring #hiring #HIRINGNOW #remote'
print(re.findall('(?i)\#(hiring|remotejob)', text))


['remotejob', 'RemoteJob', 'Hiring', 'hiring', 'HIRING']


In [26]:
regex_hiring_hashtag = '(?i)\#(hiring|remotejob|job)'
indices_with_hiring_hashtag = []
indices_with_hiring_hashtag = match_regex(regex_hiring_hashtag, df, 'tweet')


<span style="background-color:Teal"> Lets check these tweets
    </span>

In [27]:
print_regex(indices_with_hiring_hashtag,df,'tweet')
    

Recruiter .com's April 2021 recruiter index® has found that the demand for in-person jobs is outpacing that of remote work.   Hit the link below for the full Recruiter Index®.   https://t.co/vh2wufdTpO #recruiterindex #recruiters #recruitment #jobmarket #labormarket #hiringtrends 

Hi @usepixie are Hiring Remote!  (Anywhere 100% Remote)  👉 Product Designer  Apply now 👇 #remotejob #hiring #design   https://t.co/7mdXI0v82K 

We’re #hiring a P/T In House #Editor! Come work with us and dream forward positive futures! #job #freelance #remotework @WritersofColor    https://t.co/m2JTEpMIxo 

😲Green Man Gaming are on the lookout for an EVP Performance Marketing  Fully remote! Based in 🇬🇧 UK   https://t.co/aNRXbQt8T6  #remote #job #remotework 

What pre-pandemic job trends suggest about the post-pandemic future of the capital region @BrookingsInst -  https://t.co/HWdE6fcweq #remotework #remoteworking #remotejobs 

⭐New Remote Job on Incluzion -👩🏾‍💻- Partner Success Manager⭐: 🏢Company Name: Medi

<span style="background-color:Teal"> We have succesfully extract all the hiring ads.Lets remove this from our dataframe
    </span>

In [28]:
# drop this hiring tweets
df = df.drop(indices_with_hiring_hashtag)


In [29]:
df = df.reset_index(drop = True)
df.shape

(8478, 4)

<span style="background-color:Teal">lets check if hiring hashtag still exists</span>

In [30]:
indices_with_hiring_hashtag = match_regex(regex_hiring_hashtag, df, 'tweet')
indices_with_hiring_hashtag

[]

### 2.2.2 Match by keyword

<span style="background-color:Teal">Lets remove other hiring and ads keywords such as <br>
hiring|new remote job|open for|looking for|seeking|Job Vacancy|click here|subscribe|check it out|click|tips|check out|applications|on the lookout for|work with us|available now|Find out more
<span>
    
We also need to remove Tweets that are capture remote work keyword by user name such as: <br>
remoteworkrebel @simonpaix Very cool"<remotework>
    

In [31]:
regex_hiring_keyword = '(?i)(hiring|Microsoft Teams|listen here|new remote job|open for|looking for|seeking|Job Vacancy|subscribe|check it out|click|tips|check out|applications|on the lookout for|work with us|available now)'
indices_hiring_keyword = match_regex(regex_hiring_keyword, df, 'tweet')


In [32]:
print_regex(indices_hiring_keyword,df, 'tweet')

Congratulations to Dr. Michael Starrett on successfully defending his dissertation!  Mike worked on everything from immersive VR with a treadmill to remote VR testing!  Great work Mike and looking forward to seeing what you do next with Dr. Liz Chrastil at UC Irvine! 

Now hiring: devops engineers. I need 4 people for some big projects. Would prefer senior level but looking for specific skills in aws/azure so that can be flexible. Remote work if you want. (most of us are now anyway). #DevOps #NowHiring #tech #azure #AWS 

@JoelGratcyk I feel your pain.   I was looking for a new position recently. I had one company change their minds about remote work after three interviews and a formal offer. They decided at the last minute that they  wanting two days a week onsite. They were in Atlanta.  Me? Baltimore.  https://t.co/MmMSzNpUNI 

It’s time for transition: How Microsoft Teams is better than Skype for Business? Learn more @  https://t.co/59Q4IAOtyD #meetings #microsoftteams #communicatio

In [33]:
df_hiring_removed = df.drop(indices_hiring_keyword)
df_hiring_removed = df_hiring_removed.reset_index(drop = True)


<span style="background-color:Teal"> We've succesfully removed many ads from over 9000 rows to 7000~ rows <span>

In [34]:
df = df_hiring_removed
print(df.shape)

(7225, 4)


In [35]:
df.to_csv('tweets.csv',index = False)


In [36]:
# restart from here
df = pd.read_csv('tweets.csv')

## 2.3 Remove username with remote <a id = '2.3_Remove_Username'></a>

In [37]:
regex_username = '(?i)(remote)'
indices_remote_username = match_regex(regex_username, df, 'username')
print_regex(indices_remote_username, df, 'username')

LeadingRemotely 

remote_wander 

RemoteWorkNews 

RemoteWorkNews 

RemoteWorkNews 

ChiefRemote 

RemotelyWire 

ThinkRemote 

RemoteWorkNews 

RemoteWorkNews 

RemoteWorkNews 

RemoteWorkNews 

ThinkRemote 

GoRemote1 

GoRemote1 

remotedailylive 

GoRemote1 

GoRemote1 

VirtiraRemote 

remoteworkrebel 



In [38]:
df = df.drop(indices_remote_username, axis = 0)
df = df.reset_index(drop = True)

In [39]:
print(match_regex(regex_username, df, 'username'))

[]


# 3-Remove unwanted characters and words<a id='3_Remove_Unwanted'></a>

<span style="background-color:Teal"> In this section we will remove unwanted part of our tweet that are not going to be useful for our analysis: <br>
<span style="background-color:Teal">
- url link<br>
- emoji<br>
- mention @ some user<br>
- Hash tags only tweets. example:<br>
#NGOs #NPOs #DigitalNGOs #Tech4Good #NpTech #Productivity #RemoteWork @TechSoup
- Questions only tweets. example:<br>
 Is your desk crammed with papers? Are you sitting properly? What about stretching and resting your eyes?  We wrote a blog post on how to make your work environment work for you and listed some good habits to take on when working remotely<br>
<span style="background-color:Teal">
    


-Once we have clearn our text we will finally remove duplicate tweets: <br>
'How HPE, Verizon and Mars Wrigley Manage Employees During Remote Work'  https://t.co/FD7Us5BiUY <br>
'How HPE, Verizon and Mars Wrigley Manage Employees During Remote Work'  https://t.co/phXphFFwxH <br>

## 3.1 Remove Http link<a id = '3.1_Remove_Http'></a>

In [40]:
# 1. our regex
regex_new_url = r'(?i)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:\'".,<>?«»""'']))'


In [41]:

def remove_regex(the_regex, the_df):
    '''Remove the regex string from our tweets'''
    the_df = the_df.replace(to_replace=the_regex, value='', regex=True)
    return the_df

In [42]:
df = remove_regex(regex_new_url, df)


## 3.2 Remove emoji<a id = '3.2_Remove_Emoji'></a>


<span style="background-color:Teal">We shouldn't simply remove emojis in sentiment analysis as smileys can give huge cue in the sentiment

In [43]:
import emoji
text = "🇺🇸 #remotework 📚☕ #Iowa 👩‍👧 #homestate 👨‍👦 😎  #remotework #RemoteChat"
emoji.demojize(text)

':United_States: #remotework :books::hot_beverage: #Iowa :family_woman_girl: #homestate :family_man_boy: :smiling_face_with_sunglasses:  #remotework #RemoteChat'

In [44]:
# 1. create a list of demojize text
demojize_tweets = []

# 2. append to dataframe
for i in range(len(df)):
    text = df.iloc[i].tweet
    demojize_tweet = emoji.demojize(text)
    demojize_tweets.append(demojize_tweet)
    
# 3. check our list
len(demojize_tweets)

6919

In [45]:
print(demojize_tweets[175])
print(demojize_tweets[182])

Survey: Working Parents Will Quit Without Remote Work    
Over the past year, many transitioned successfully to #remotework. Now as organizations are reopening, we identified five key steps to give workers the flexibility to work from home, at the office, and everywhere in between. #ReturnToOffice  


In [46]:
demojize_tweets_series = pd.DataFrame(demojize_tweets)
df['tweet'] = demojize_tweets_series


## 3.3 Remove Mention<a id = '3.3_Remove_Mention'></a>

In [47]:
text = '@TimSackett @lruettimann @FrankZupan @_Lars +1. Real Estate Market would crash if there is no demand for commercial space.   Hybrid work / Remote work works .'
regex_mention = '(@[_A-Za-z0-9]+)'

re.sub(regex_mention, '',text)

'    +1. Real Estate Market would crash if there is no demand for commercial space.   Hybrid work / Remote work works .'

In [48]:
df = remove_regex(regex_mention, df)


## 3.4 Remove 'Read, find out, learn more..'

<span style="background-color:Teal"> Remove all characters after the phrase **read more, learn more, find our more, read our** . Since these are usally just pointing out to resources

In [49]:
regex_read_our = '(?s)(?i)(read our)(?<=read our)(.*$)'
text = 'Should you invest in Employee Engagement tools for remote work? Read our review of the top tools.    #remotework #wfh  #hr #workplaceculture #business #employees  #employeerecognition #employeeretention #management #employeeengagement #workplace '

re.findall(regex_read_our,text)

[('Read our',
  ' review of the top tools.    #remotework #wfh  #hr #workplaceculture #business #employees  #employeerecognition #employeeretention #management #employeeengagement #workplace ')]

In [50]:
regex_learn_more = '(?s)(?i)(learn more)(?<=learn more)(.*$)'
regex_read_more = '(?s)(?i)(read more)(?<=read more)(.*$)'
regex_find_out_more = '(?s)(?i)(find out more)(?<=find out more)(.*$)'
text2 = 'While your office may never become a fully #remote office, remote work for law firms can offer great benefits—as long as you are aware of and can mitigate the risks. Learn more in our recent article.     '
re.findall(regex_learn_more,text2)

[('Learn more', ' in our recent article.     ')]

In [51]:
df = remove_regex(regex_find_out_more, df)
df = remove_regex(regex_learn_more, df)
df = remove_regex(regex_read_more, df)


In [52]:
def print_empty_tweet(the_df):
    print(df[df.tweet == ""])
    
print_empty_tweet(df)

                     date tweet      username Language
1323  2021-05-12 22:09:04        ActorsEquity       en
4606  2021-05-11 05:42:01        ActorsEquity       en


In [53]:
# df = df[df.tweet != ""]
# print_empty_tweet(df)

def delete_empty_tweet(the_df):
    the_df = the_df[the_df.tweet != ""]
    
delete_empty_tweet(df)
print_empty_tweet(df)


                     date tweet      username Language
1323  2021-05-12 22:09:04        ActorsEquity       en
4606  2021-05-11 05:42:01        ActorsEquity       en


In [54]:
df.to_csv('tweets.csv',index =False)

In [55]:
# restart from here
df = pd.read_csv('tweets.csv')


In [56]:
print_empty_tweet(df)

Empty DataFrame
Columns: [date, tweet, username, Language]
Index: []


## 3.5 Remove Special Characters, Numbers, Punctuation<a id = '3.4_Remove_Specials'></a>

In [57]:
import string

# function to remove special characters and numbers
def remove_special_characters(the_text):
    # define the pattern to keep
    pat = r'[^a-zA-z.,#!?\"\'\s\’]' 
    return re.sub(pat, ' ', the_text)

def remove_extra_whitespace_tabs(text):
    pattern = '(^\s*|\s\s*)'
    return re.sub(pattern, ' ', text).strip()

In [58]:
df

Unnamed: 0,date,tweet,username,Language
0,2021-05-13 06:58:35,+1. Real Estate Market would crash if ther...,rucsb,en
1,2021-05-13 06:55:20,"Concur. My company was 100% ""you MUST work in...",thirdnline,en
2,2021-05-13 06:55:01,Why not ask if we really need that thing? I t...,toofarnorth49,en
3,2021-05-13 06:54:11,"Dear Line Managers, Appraisal your subordinate...",uzomabenny,en
4,2021-05-13 06:52:32,I've had more opportunities to work cross-fu...,juliethrelkeld,en
...,...,...,...,...
6914,2021-05-10 00:30:32,A majority of people want full-time remote wor...,chris_herd,en
6915,2021-05-10 00:30:20,Tech makes more gig work possible while covid ...,AdventuresOTM,en
6916,2021-05-10 00:29:42,Catherine Merrill should lose her job. The abs...,SukolVentures,en
6917,2021-05-10 00:29:26,Before we even get to prediction: the OBSERVAT...,MASHAgindler,en


In [59]:
df.isnull().value_counts()
df.dropna(inplace = True)
df.reset_index(drop = True)

Unnamed: 0,date,tweet,username,Language
0,2021-05-13 06:58:35,+1. Real Estate Market would crash if ther...,rucsb,en
1,2021-05-13 06:55:20,"Concur. My company was 100% ""you MUST work in...",thirdnline,en
2,2021-05-13 06:55:01,Why not ask if we really need that thing? I t...,toofarnorth49,en
3,2021-05-13 06:54:11,"Dear Line Managers, Appraisal your subordinate...",uzomabenny,en
4,2021-05-13 06:52:32,I've had more opportunities to work cross-fu...,juliethrelkeld,en
...,...,...,...,...
6912,2021-05-10 00:30:32,A majority of people want full-time remote wor...,chris_herd,en
6913,2021-05-10 00:30:20,Tech makes more gig work possible while covid ...,AdventuresOTM,en
6914,2021-05-10 00:29:42,Catherine Merrill should lose her job. The abs...,SukolVentures,en
6915,2021-05-10 00:29:26,Before we even get to prediction: the OBSERVAT...,MASHAgindler,en


In [60]:
new_text_list = []
for i in range(len(df)):   
    old_text = df.iloc[i].tweet
    new_text = df.iloc[i].tweet.replace("’", "'")
    new_text = new_text.replace("-", " ")
    new_text = remove_special_characters(new_text)
    new_text = remove_extra_whitespace_tabs(new_text)
    
    new_text_list.append(new_text)
    
df['tweet'] = pd.DataFrame(new_text_list)


In [61]:
print_tweet(df,len(df))

0
. Real Estate Market would crash if there is no demand for commercial space. Hybrid work Remote work works . If we design for it. For decades, Office space worked as space to socialize with fellow human beings. 

1
Concur. My company was "you MUST work in the office" and now they have said that is gone. More importantly many of our leaders have moved remote and we have hired remotely. That is a genie that is REALLY hard to put back in the bottle. 

2
Why not ask if we really need that thing? I think it would be fair that anyone that could work remote was privileged to do so, considering how many essential workers and small businesses got fucked over. Coming to to the office now feels like a teacher telling me to learn cursive 

3
Dear Line Managers, Appraisal your subordinate based on their Job performance and not sentiment ,blood line , religious group or tribe. #HR #Career #peformanceappraisal #EmployeeExperience #remotework #employees 

4
I've had more opportunities to work cross 

## 3.6 Expanding Contractions

In [62]:
import contractions
contractions.fix("If a company cared about you they'd ask you to three-days per week")


'If a company cared about you they would ask you to three-days per week'

In [63]:
df.isnull().value_counts()
df.dropna(inplace = True)
df.reset_index(drop = True)

Unnamed: 0,date,tweet,username,Language
0,2021-05-13 06:58:35,. Real Estate Market would crash if there is n...,rucsb,en
1,2021-05-13 06:55:20,"Concur. My company was ""you MUST work in the o...",thirdnline,en
2,2021-05-13 06:55:01,Why not ask if we really need that thing? I th...,toofarnorth49,en
3,2021-05-13 06:54:11,"Dear Line Managers, Appraisal your subordinate...",uzomabenny,en
4,2021-05-13 06:52:32,I've had more opportunities to work cross func...,juliethrelkeld,en
...,...,...,...,...
6910,2021-05-10 00:35:43,A majority of people want full time remote wor...,JDATSG,en
6911,2021-05-10 00:35:35,Tech makes more gig work possible while covid ...,ms_geezy,en
6912,2021-05-10 00:30:32,Catherine Merrill should lose her job. The abs...,chris_herd,en
6913,2021-05-10 00:30:20,Before we even get to prediction the OBSERVATI...,AdventuresOTM,en


In [64]:
text_list = []
for i in range(len(df)):
    text = df.iloc[i].tweet
    new_text = contractions.fix(df.iloc[i].tweet)
    new_text = new_text.strip()
    
    text_list.append(new_text)
    
df['tweet'] = pd.DataFrame(text_list)


# 4 - Lemmatizing<a id='4_Lemmatizing'></a>

In [65]:
import spacy
ori_doc = "we are eating and swimming; we have been eating and swimming; he eats and swims ; he ate and swam "
nlp = spacy.load('en_core_web_sm')

def get_lemmatized(the_text):
    the_text = nlp(the_text)
    the_text = ' '.join([word.lemma_ if word.lemma_ != '-PRON-' else word.text for word in the_text])
    return the_text


get_lemmatized(ori_doc)

'we be eat and swimming ; we have be eat and swim ; he eat and swim ; he eat and swam'

In [66]:
df.isnull().value_counts()
df.dropna(inplace = True)
df.reset_index(drop = True)

Unnamed: 0,date,tweet,username,Language
0,2021-05-13 06:58:35,. Real Estate Market would crash if there is n...,rucsb,en
1,2021-05-13 06:55:20,"Concur. My company was ""you MUST work in the o...",thirdnline,en
2,2021-05-13 06:55:01,Why not ask if we really need that thing? I th...,toofarnorth49,en
3,2021-05-13 06:54:11,"Dear Line Managers, Appraisal your subordinate...",uzomabenny,en
4,2021-05-13 06:52:32,I have had more opportunities to work cross fu...,juliethrelkeld,en
...,...,...,...,...
6908,2021-05-10 00:38:03,A majority of people want full time remote wor...,WarMad58,en
6909,2021-05-10 00:38:01,Tech makes more gig work possible while covid ...,IMPACT360_BE,en
6910,2021-05-10 00:35:43,Catherine Merrill should lose her job. The abs...,JDATSG,en
6911,2021-05-10 00:35:35,Before we even get to prediction the OBSERVATI...,ms_geezy,en


In [67]:
text_list = []
for i in range(len(df)):
    ori_text = df.iloc[i].tweet
    new_text = get_lemmatized(ori_text)
    
    text_list.append(new_text)

df['tweet'] = pd.DataFrame(text_list)    
    

In [68]:
df.to_csv('tweets.csv',index = False)


In [69]:
df = pd.read_csv('tweets.csv')


In [70]:
df.isnull().value_counts()
df.dropna(inplace = True)
df.reset_index(drop = True)

Unnamed: 0,date,tweet,username,Language
0,2021-05-13 06:58:35,. real Estate Market would crash if there be n...,rucsb,en
1,2021-05-13 06:55:20,"Concur . my company be "" you must work in the ...",thirdnline,en
2,2021-05-13 06:55:01,why not ask if we really need that thing ? I t...,toofarnorth49,en
3,2021-05-13 06:54:11,"dear Line Managers , Appraisal your subordinat...",uzomabenny,en
4,2021-05-13 06:52:32,I have have more opportunity to work cross fun...,juliethrelkeld,en
...,...,...,...,...
6906,2021-05-10 00:39:46,a majority of people want full time remote wor...,GetMustered,en
6907,2021-05-10 00:38:44,tech make more gig work possible while covid s...,phillrow,en
6908,2021-05-10 00:38:03,Catherine Merrill should lose her job . the ab...,WarMad58,en
6909,2021-05-10 00:38:01,before we even get to prediction the observati...,IMPACT360_BE,en


# 5 - Manual Cleaning

In [154]:
new_text_list = []

for i in range(len(df)):
    old_text = df.iloc[i].tweet
    new_text = remove_extra_whitespace_tabs(old_text)
    
    
    # 1. remove repeat chars
    new_text = new_text.replace("...", ".")
    new_text = new_text.replace("..", ".")
    new_text = new_text.replace(",.", ".")
    new_text = new_text.replace(",,", ",")
    
    # 2. remove sapce after hashtag
    new_text = new_text.replace("# ", "#")
    new_text = new_text.replace('" ', '')
    
    # 3. remove double quote
    new_text = new_text.replace('"', '')
    new_text = new_text.replace("'", "")
    
    # 4. remove K or k
    new_text = re.sub(r'\b(k|K)\b', "", new_text)
    new_text = re.sub(r'\bamp\b', "", new_text)
    new_text = re.sub(r'\bgt\b', "", new_text)
    new_text = re.sub(r'\bvia\b', "", new_text)
    
    # 5. remove spaces before special chars
    new_text = new_text.replace(" .", ".")
    new_text = new_text.replace(" ,", ",")
    new_text = new_text.replace(" '", "'")
    new_text = new_text.replace(" ?", "?")
    new_text = new_text.replace(" !", "!")
    new_text = new_text.replace('  ', ' ')
    
    new_text = re.sub(' +', ' ', new_text)
    # 7. remove non-alphanumeric character at the beg. of a sentence
    new_text = re.sub('(^\W\s)', "", new_text) 
    
    new_text_list.append(new_text)
    
df['tweet'] = pd.DataFrame(new_text_list)

In [155]:
df.isnull().value_counts()
df.dropna(inplace = True)
df.reset_index(drop = True)

Unnamed: 0,date,tweet,username,Language
0,2021-05-13 06:58:35,real Estate Market would crash if there be no ...,rucsb,en
1,2021-05-13 06:55:20,Concur. my company be you must work in the off...,thirdnline,en
2,2021-05-13 06:55:01,why not ask if we really need that thing? I th...,toofarnorth49,en
3,2021-05-13 06:54:11,"dear Line Managers, Appraisal your subordinate...",uzomabenny,en
4,2021-05-13 06:52:32,I have have more opportunity to work cross fun...,juliethrelkeld,en
...,...,...,...,...
6110,2021-05-10 00:40:03,I want to live a bi coastal life! #manifestati...,MPBorman,en
6111,2021-05-10 00:39:46,a majority of people want full time remote wor...,GetMustered,en
6112,2021-05-10 00:38:03,Catherine Merrill should lose her job. the abs...,WarMad58,en
6113,2021-05-10 00:38:01,before we even get to prediction the observati...,IMPACT360_BE,en


In [156]:
# text = 'how do you design consumer grade amp delikery of #hr K k service? join amp on to'

# text = re.sub(r'\b([a-z]{1,2})\b', " ", text)


# # 6. remove K or k
# text = re.sub(r'\b(k|K)\b', "", text)
# text = re.sub(r'\bamp\b', "", text)
# text = re.sub(r'\bgt\b', "", text)
# text = re.sub(r'\bvia\b', "", text)

# text = re.sub(' +', ' ', text)

# print(text)

# 6 - Remove duplicates<a id ='7_Remove_Duplicates'></a>

In [157]:
df_duplicates_removed = df.drop_duplicates(subset=['tweet'], keep='first')

In [158]:
print(df.shape)
print(df_duplicates_removed.shape)


(6115, 4)
(6099, 4)


In [159]:
df = df_duplicates_removed.reset_index(drop = True)

In [160]:
print_tweet(df, len(df))

0
real Estate Market would crash if there be no demand for commercial space. hybrid work remote work work. if we design for it. for decade, Office space work as space to socialize with fellow human being. 

1
Concur. my company be you must work in the office and now they have say that be go. more importantly many of our leader have move remote and we have hire remotely. that be a genie that be REALLY hard to put back in the bottle. 

2
why not ask if we really need that thing? I think it would be fair that anyone that could work remote be privileged to do so, consider how many essential worker and small business got fuck over. come to to the office now feel like a teacher tell I to learn cursive 

3
dear Line Managers, Appraisal your subordinate base on their Job performance and not sentiment, blood line, religious group or tribe. #hr #career #peformanceappraisal #employeeexperience #remotework #employee 

4
I have have more opportunity to work cross functionally and engage with compan

In [163]:
print_empty_tweet(df)
df.isnull().value_counts()

Empty DataFrame
Columns: [date, tweet, username, Language]
Index: []


date   tweet  username  Language
False  False  False     False       6099
dtype: int64

# 7 - Save Final Data

In [164]:
df.to_csv('tweets.csv', index = False)

# 8-Pipeline<a id = '9_Pipeline'></a>

<span style="background-color:Teal">Remove hashtag-only sentences: <br> 
- #wfh articles. #WorkFromHomeJobs #workfromhome #remotework #remoteworking
    
<span style="background-color:Teal">Since we don't want to just remove all hashtags, lets just remote hashtags that apear at the end of sentences since these won't affect the sentiment of the the text. Unlike hashtags that are in the middle of sentences

## 8.1 Remove ending hashtags (Add to pipeline 1) <a id= '9.1_Remove_Ending_Hashtags'></a>


In [165]:

def remove_end_hashtag(the_df, end_index):  
    'Remove hashtags at the end of every tweet(rows) in a dataframe'
    regex_hashtag = '(#[A-Za-z0-9]+)'
    
    # create list to be appended to our df
    new_text_list = []
    # get text in each row
    for index_row in range(end_index):

        text = the_df.iloc[index_row].tweet
        
        # split the text
        text_list = text.split()
        
        index_list = -1     
        
        # get word in each list from the back
        for word in reversed(text_list):
            #if its a hashtag, remove it
            if(re.match(regex_hashtag,word) != None):
                text_list.pop(index_list)
            else:
                break
        # join back
        new_text = ' '.join(text_list)
        new_text_list.append(new_text)
    
    # update the_df
    the_df['tweet'] = pd.DataFrame(new_text_list)
    
    #drop nan
    the_df.dropna(inplace = True)
    the_df.reset_index(drop = True)
    

In [166]:
# Test removing end hashtags
df_pipeline = df.copy()
remove_end_hashtag(df_pipeline, len(df_pipeline))

In [168]:
print_tweet(df_pipeline, 10)

0
real Estate Market would crash if there be no demand for commercial space. hybrid work remote work work. if we design for it. for decade, Office space work as space to socialize with fellow human being. 

1
Concur. my company be you must work in the office and now they have say that be go. more importantly many of our leader have move remote and we have hire remotely. that be a genie that be REALLY hard to put back in the bottle. 

2
why not ask if we really need that thing? I think it would be fair that anyone that could work remote be privileged to do so, consider how many essential worker and small business got fuck over. come to to the office now feel like a teacher tell I to learn cursive 

3
dear Line Managers, Appraisal your subordinate base on their Job performance and not sentiment, blood line, religious group or tribe. 

4
I have have more opportunity to work cross functionally and engage with company leadership in the past year than in the Before Times. Remote have flatten

<span style="background-color:Teal"> Our original df will not have its hashtag removed. We will only do this in the pipeline <span>

## 8.2 Split hashtag words (Add to pipeline 2)<a id= '8.2_Split_Hashtags_Words'></a>

<span style="background-color:Teal">We want to split hash tag words that are part of the tweet that consist of two words. E.g: <br> 
convert: <br>
- Some buy #SecondHome before #PrimaryResidence! <br>
to: <br>
- Some buy second home before primary residence!

In [169]:
from __future__ import division
from collections import Counter
import re, nltk
# nltk.download('brown')

WORDS = nltk.corpus.brown.words()
COUNTS = Counter(WORDS)

def pdist(counter):
    "Make a probability distribution, given evidence from a Counter."
    N = sum(counter.values())
    return lambda x: counter[x]/N

P = pdist(COUNTS)

def Pwords(words):
    "Probability of words, assuming each word is independent of others."
    return product(P(w) for w in words)

def product(nums):
    "Multiply the numbers together.  (Like `sum`, but with multiplication.)"
    result = 1
    for x in nums:
        result *= x
    return result

def splits(text, start=0, L=20):
    "Return a list of all (first, rest) pairs; start <= len(first) <= L."
    return [(text[:i], text[i:]) 
            for i in range(start, min(len(text), L)+1)]

def segment(text):
    "Return a list of words that is the most probable segmentation of text."
    if not text: 
        return []
    else:
        candidates = ([first] + segment(rest) 
                      for (first, rest) in splits(text, 1))
        return max(candidates, key=Pwords)

print(segment('primary'))



['primary']


In [170]:
# #create function that takes in 1 hashtag word, and convert to split word

def get_capital_letter_index(the_word):
    "get the first index of the capital letter"
    for i in range(1, len(the_word)):
        if((the_word[i]).isupper()):
            return i
    return -1

def isAllCapital(the_word):
    "Check if a word is all capital letters"
    for i in range(1, len(the_word)):
        if(not(the_word[i]).isupper() and(the_word[i].isalpha())):
            return False
    return True
    

def get_split_word(the_word):
    'function that takes in 1 hashtag word, and convert to split words'
    final_word_list = []
    loose_char_list = []
    
    # CASE A: if all is capital return as is
    if(isAllCapital(the_word)):

        return the_word
    
    # CASE B: capital is in the middle, then split before the capital
    index_capital = get_capital_letter_index(the_word)
    if(index_capital != -1):
        string1 = the_word[0:index_capital]
        string2 = the_word[index_capital: len(the_word)]
        final_word_list = [string1, string2]
        return ' '.join(final_word_list)
    
    # CASE C: the word are not split by capital letter
    else:
        # 1. segment the word
        final_word_list = segment(the_word)  

        # 2. now we want to make sure word less than 3 chars is merged to previous word
        index = 1
        end_index = len(final_word_list)
        while(index < end_index):
            # 3. if length of current word is less than 2
            if( len(final_word_list[index]) <= 2):
                
                # 4. join current word to previous word
                final_word_list[index-1] = ''.join(final_word_list[(index-1):(index+1)])
                
                # 5. delete word at current
                final_word_list.pop(index)
                
                # 6. update end index after pop
                end_index = len(final_word_list)
            else:
                index += 1

    return ' '.join(final_word_list)   

print(get_split_word('remotework'))

                        

remote work


In [171]:
# my_list = ['start','up', 'growth']
# merge_tuples = [(1, 3), (5, 7)]

# index = 1
# end_index = len(my_list)
# while(index < end_index):
#     print('\nindex:', index)
#     print('word:', my_list[index])
#     if( len(my_list[index]) <= 3):
#         print('len of word is smaller than 3:')
#         my_list[index-1] = ''.join(my_list[(index-1):(index+1)])
        
#         my_list.pop(index)
#         print('my_list:',my_list)
#         end_index = len(my_list)
#         print('end index', end_index)
#     else:
#         index += 1

# print(' '.join(my_list))



In [176]:
def split_hashtag(the_df):
    'Split the hashtags in a given data frame'
    regex_hashtag = '#(\S*)'
    final_tweet_list = []
    
    # loop through each row and add modfied text to our final_tweet_list
    for i in range(len(the_df)):
        print(i)
        hashtag_word_dict = {}
        # 1. get list of merged words in a tweet: hashtag_word_list
        text = the_df.iloc[i].tweet
        hashtag_words_list = re.findall(regex_hashtag, text)      
    
        # 2. if there is merged words
        if(len(hashtag_words_list)!=0):  
            print('################')
#             print(hashtag_words_list)
            # for each word
            for i in range(len(hashtag_words_list)):
           
                # 3. get the word in each list, start from char 1 not 0. 0 is a symbol assign as key
                key_before_split = hashtag_words_list[i][0:len(hashtag_words_list[i])]
    
                # 4. split this word, assign as value
                value_after_split = get_split_word(key_before_split)
#                 print(key_before_split, ":  ", value_after_split)
            
                # 5. create dict 
                hashtag_word_dict[key_before_split] = value_after_split
            
    
        # 6. remove hashtag symbol from the text
        text = text.replace('#', '')
        text_list = text.split(' ')
    
        # 7. loop through the text. if it finds the word in our key, replace with our value
        for i in range(len(text_list)):
            for key, value in hashtag_word_dict.items():
                if(text_list[i] == key):
                    text_list[i] = value
    
        new_text = ' '.join(text_list)
        final_tweet_list.append(new_text)
#         print(new_text)

    # update df
    the_df['tweet'] = pd.DataFrame(final_tweet_list)
    
    #drop nan
    the_df.dropna(inplace = True)
    the_df.reset_index(drop = True)


In [177]:
# Test split hashtag after removing ending hashtag

split_hashtag(df_pipeline)

0
1
2
3
4
5
6
7
################
8
9
10
11
################
12
13
################
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
################
30
31
################
32
33
34
35
36
################
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
################
58
59
60
61
62
63
################
64
65
################
66
67
68
################
69
################
70
71
72
73
74
75
76
77
78
79
80
################
81
82
83
84
85
86
################
87
88
89
90
91
92
93
################
94
95
################
96
97
98
################
99
################
100
101
102
################
103
104
################
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
################
130
################
131
132
133
################
134
135
136
137
138
139
140
141
142
################
143
144
145
146
147
148
################
149
150
151
152
153
154
################
155
156
157
158
159
160
161
162
163
164
################
165
166


In [178]:
print_tweet(df_pipeline,len(df_pipeline))

0
real Estate Market would crash if there be no demand for commercial space. hybrid work remote work work. if we design for it. for decade, Office space work as space to socialize with fellow human being. 

1
Concur. my company be you must work in the office and now they have say that be go. more importantly many of our leader have move remote and we have hire remotely. that be a genie that be REALLY hard to put back in the bottle. 

2
why not ask if we really need that thing? I think it would be fair that anyone that could work remote be privileged to do so, consider how many essential worker and small business got fuck over. come to to the office now feel like a teacher tell I to learn cursive 

3
dear Line Managers, Appraisal your subordinate base on their Job performance and not sentiment, blood line, religious group or tribe. 

4
I have have more opportunity to work cross functionally and engage with company leadership in the past year than in the Before Times. Remote have flatten

## 8.3 - Remove Stopwords

In [179]:
import nltk.corpus
from nltk.corpus import stopwords
# nltk.download('stopwords')


def remove_stopwords(the_df):
    stop = stopwords.words('english')
    stop.remove('not')
    stop.extend(["click"])

    the_df.tweet = the_df.tweet.apply(lambda x: ' '.join([word for word in x.split() if word not in (stop)]))


In [181]:
stop = stopwords.words('english')
stop.remove('not')
stop.extend(["click"])

text = 'asds amp gt click here'
text = ' '.join([word for word in text.split() if word not in (stop)])
text

'asds amp gt'

## 8.4 Remove senteces less than 4 words <a id='9.3_Remove_Less_Than_4'></a>

In [None]:
def remove_short_tweets(the_df):
    
    indices_to_drop = []
    for i in range(len(df)):
        # 1. split text
        word_list = df.iloc[i].tweet.split()
        print(word_list)
        
        # 2. check len
        if(len(word_list) <= 4):
            indices_to_drop.append(i)
    
    the_df.drop(indices_to_drop, axis = 0)
    the_df.reset_index(drop = True)
        

8