## Dnyanai Surkutwar <br><ol><li>[Part I: Creating Dataset](#part1)</li><li>[Part II: Data Preprocessing](#part1)</li><li>[Part III: Extracting data from PDFs or web scrape the site's content](#part3)</li><li>[Part IV: Clean the extracted text](#part4)</li>

## Importing libraries

In [None]:
import pandas as pd 
import tweepy as tw 

## For Web Scraping:
from bs4 import BeautifulSoup
import numpy as np 
from time import sleep 
from random import randint
from selenium import webdriver
import requests

# Table of Content: 

1.[Extracting Tweets from Twitter user - @SalesforceNews](#tweepy) <br>
   * [Setting up twitter app authentication](#tweepy)
   * [Creating API object](#auth)
   * [Collecting tweets](#collect)
   * [Extracting relevant information from the tweets](#extract)
   * [Creating Dataframe of extracted tweets](#df)
   * [Extracting labels and articles](#labels) <br>
   * [GetLabels and GetArticles Functions](#fns)<br>
   * [Pickling dataset](#pkl)<br>

<a id = 'tweepy'></a>
# Extracting tweets using tweepy

In [None]:
#Authentication for twitter:
consumer_key= 'aaa'
consumer_secret= 'dd'
access_token= 'dd'
token_secret= 'dd'

<a id = 'auth'></a>
## Creating API object:

In [None]:
auth = tw.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, token_secret)

api = tw.API(auth, wait_on_rate_limit=True)

In [None]:
screen_name = 'SalesforceNews'

<a id='collect'></a>
## Collecting tweets

In [None]:
output = [status._json for status in tw.Cursor(api.search, q='SalesforceNews -filter:retweets',\
                                                count=100, tweet_mode='extended', include_entities=True, lang='en').items()]

In [None]:
#import pickle
#filename = 'tweets_100_Dec_3.pkl'
#out = open(filename,'wb')
#pickle.dump(output,out)
#out.close()

In [None]:
output[2]['entities']['urls']

[]

<a id='extract'></a>
## Extracting the relevant information from the tweets

In [None]:
full_text = []   ## saving the tweet text in a list
post_url = []    ## saving the urls in a list 
tw_id = []       ## saving the tweet id in a list

for each in output:
    tw_id.append(each['id_str'])
    full_text.append(each['full_text'])
    post_url.append(each['entities']['urls'])

In [None]:
#post_url[0][0]['display_url'] ## testing different keys in the urls 

In [None]:
#post_url[0][0]['expanded_url']  ## We need the expanded url
urls = []   ## creating new list to save expanded urls only
disp_urls = [] ## we might get more information from display urls too, so creating a new list for that

for i in range(len(post_url)):
    #print(post_url[i])
    if post_url[i] != []:
        urls.append(post_url[i][0]['expanded_url'])
        disp_urls.append(post_url[i][0]['display_url'])
    else:
        urls.append('')
        disp_urls.append('')
disp_urls[:5]

['',
 'salesforce.com/news/stories/s…',
 '',
 'twitter.com/SalesforceNews…',
 'sforce.co/3oH7mIA']

In [None]:
full_text[2]

"As #COVID19 continues to disrupt the food system, we're working together to strengthen our food system’s connective tissue \n@OUSDNews @SalesforceNews @eatlearnplay\n@WCKitchen @numifoundationt @FullHarvestTech @MandelaPartners @UberFreight"

<a id='#df'></a>
## Creating a Dataset dataframe from the collected tweets: 

In [None]:
data = {'tweet_id': tw_id, 'full_text':full_text, 'url':urls, 'disp_url':disp_urls}

df_salesforce = pd.DataFrame(data=data)

In [None]:
df_salesforce.head()

Unnamed: 0,tweet_id,full_text,url,disp_url
0,1337280322011078656,@SalesforceNews Hi I have a question Salesforc...,,
1,1337265529875271682,Salesforce’s Wade Wegner on the Growth of Mode...,https://www.salesforce.com/news/stories/salesf...,salesforce.com/news/stories/s…
2,1337189264296054784,As #COVID19 continues to disrupt the food syst...,,
3,1337119482074943489,Got to talk to some really smart people doing ...,https://twitter.com/SalesforceNews/status/1337...,twitter.com/SalesforceNews…
4,1337092473877884929,.@CDCFound is at the forefront of COVID-19 rel...,https://sforce.co/3oH7mIA,sforce.co/3oH7mIA


## Creating labels based on the url redirect -> stories: Case Studies, press-releases : Press Releases. This is specific to Salesforce. 

## Some of the urls are not straightforward so we will use requests package to ping them and see their url address, to check if they can be categorized into 'case studies' or 'press release': 

In [None]:
'stories' in 'https://www.salesforce.com/news/stories/the-bi'

True

In [None]:
df_salesforce.head()

Unnamed: 0,tweet_id,full_text,url,disp_url
0,1337280322011078656,@SalesforceNews Hi I have a question Salesforc...,,
1,1337265529875271682,Salesforce’s Wade Wegner on the Growth of Mode...,https://www.salesforce.com/news/stories/salesf...,salesforce.com/news/stories/s…
2,1337189264296054784,As #COVID19 continues to disrupt the food syst...,,
3,1337119482074943489,Got to talk to some really smart people doing ...,https://twitter.com/SalesforceNews/status/1337...,twitter.com/SalesforceNews…
4,1337092473877884929,.@CDCFound is at the forefront of COVID-19 rel...,https://sforce.co/3oH7mIA,sforce.co/3oH7mIA


In [None]:
df_salesforce['labels'] = ''

In [None]:
df_salesforce['articles'] = ''

<a id='labels'></a>
## Extracting Labels and Articles 

In [None]:
## Using Selenium - Powerful tool but thought for this requests would be enough ##
## Installing Selenium:
#!pip install selenium

## Getting the chromedriver to work for selenium: 
#!apt-get update 
#!apt install chromium-chromedriver
#!which chromedriver

#!cp /usr/lib/chromium-browser/chromedriver /usr/bin
#import sys
#sys.path.insert(0,'/usr/lib/chromium-browser/chromedriver')

#from selenium import webdriver 

#chrome_options = webdriver.ChromeOptions()
  #chrome_options.add_argument('--headless')
  #chrome_options.add_argument('--no-sandbox')
  #chrome_options.add_argument('--disable-dev-shm-usage')
  
  
  #driver = webdriver.Chrome('chromedriver',options=chrome_options)
  #article = BeautifulSoup(driver.page_source,'html.parser') # get the aricle from the url  

## Performing following web scraping steps to extract labels as well as articles:
1. Using BeautifulSoup to scrap 
2. Using requests to get the html page for each url. 
3. Extracting labels in the labels column
4. Extracting articles on the url in articles column, if it exists. 

In [None]:
df_salesforce.url[:2]

0                                                     
1    https://www.salesforce.com/news/stories/salesf...
Name: url, dtype: object

In [None]:
df_salesforce['labels'].loc[df_salesforce.url=='https://sforce.co/3oH7mIA'].values[0]

'case studies'

In [None]:
## Installing Selenium:
!pip install selenium

## Getting the chromedriver to work for selenium: 
!apt-get update 
!apt install chromium-chromedriver
!which chromedriver

<a id='fns'></a>
## Functions to getArticles and Labels for the particular url:

In [None]:
## Get Articles function,
def getArticles(url):

  if url !='':
    #print(url)
    res = requests.get(url)
    soup = BeautifulSoup(res.content, "html.parser")

    if df_salesforce.articles.loc[df_salesforce.url==url] is not None:
      df_salesforce.articles.loc[df_salesforce.url==url] = (soup.section.text)
    else:
      pass

In [162]:
## Get Labels function,
def getLabels(link,article):
 
    check_url = [item.get('href') for item in article.find_all("link", attrs={"rel": "canonical"})] # some of the urls are not complete thus
                                                                                                 # we are taking the urls from scraping the 
                                                                                                 # attribute rel called as canonical which 

    #print(check_url)

  ## If the url is empty then continue or add labels to dataframe based on the website's href response:

    if len(check_url)!=0:                                                        ## Making sure the length of the url is not 0 i.e. empty:
        if check_url[0] is None:
            pass
      
        if 'stories' in check_url[0]:
            if df_salesforce['labels'].loc[df_salesforce.url==link].values[0] == '':   
              df_salesforce['labels'].loc[df_salesforce.url==link] = 'case studies'
              getArticles(check_url[0])
            print('stories')

        if 'press-releases' in check_url[0]:
            if df_salesforce['labels'].loc[df_salesforce.url==link].values[0] == '':   
              df_salesforce['labels'].loc[df_salesforce.url==link] = 'press releases'
            getArticles(check_url[0])
            print('pr')

## Main function to go the task of extracting labels and the respective url articles:

In [179]:
def main():

  chrome_options = webdriver.ChromeOptions()
  chrome_options.add_argument('--headless')
  chrome_options.add_argument('--no-sandbox')
  chrome_options.add_argument('--disable-dev-shm-usage')

  driver = webdriver.Chrome('chromedriver',options=chrome_options)

  
  for url in df_salesforce.url:
    if url == '':
      pass
      
    else:
        driver.get(url)

        sleep(randint(0,3)) ## do nothing [0,3] seconds

        article = BeautifulSoup(driver.page_source,'html.parser')

        getLabels(url,article)
        #driver.close()
        
if __name__ == '__main__':
  main()

['https://www.salesforce.com/news/stories/salesforces-wade-wegner-on-the-growth-of-modern-app-development-in-2021/']
https://www.salesforce.com/news/stories/salesforces-wade-wegner-on-the-growth-of-modern-app-development-in-2021/
stories
['https://twitter.com/salesforcenews/status/1337072131260084225', 'https://twitter.com/SalesforceNews/status/1337072131260084225']
['https://www.salesforce.com/news/stories/cdc-foundation-fundraising-and-initiatives/']
https://www.salesforce.com/news/stories/cdc-foundation-fundraising-and-initiatives/
stories
['https://www.salesforce.com/news/stories/salesforce-and-eat-learn-play-announce-new-partners-funding-and-scope-of-pilot-program-to-address-food-insecurity-in-the-bay-area/']
https://www.salesforce.com/news/stories/salesforce-and-eat-learn-play-announce-new-partners-funding-and-scope-of-pilot-program-to-address-food-insecurity-in-the-bay-area/
stories
['https://www.salesforce.com/news/stories/salesforce-ai-breast-cancer/']
https://www.salesforce.c

In [180]:
df_salesforce.head()

Unnamed: 0,tweet_id,full_text,url,disp_url,labels,articles
0,1337280322011078656,@SalesforceNews Hi I have a question Salesforc...,,,,
1,1337265529875271682,Salesforce’s Wade Wegner on the Growth of Mode...,https://www.salesforce.com/news/stories/salesf...,salesforce.com/news/stories/s…,case studies,\n\nCompanies across industries and regions ar...
2,1337189264296054784,As #COVID19 continues to disrupt the food syst...,,,,
3,1337119482074943489,Got to talk to some really smart people doing ...,https://twitter.com/SalesforceNews/status/1337...,twitter.com/SalesforceNews…,,
4,1337092473877884929,.@CDCFound is at the forefront of COVID-19 rel...,https://sforce.co/3oH7mIA,sforce.co/3oH7mIA,case studies,


In [181]:
#pd.set_option('max.rows',None)

#df_salesforce.to_csv('dataset.csv')

In [182]:
#df_salesforce.to_csv('tweets_Salesforce_articles.csv')

## Checking how many empty rows we have: 

In [183]:
len(df_salesforce[df_salesforce.labels==''])

32

In [184]:
len(df_salesforce)

121

In [196]:
print('Overall we have',round(((121-32)/121)*100,2),'% of labels which are divided into case studies and press releases')

Overall we have 73.55 % of labels which are divided into case studies and press releases


In [185]:
len(df_salesforce.articles)

121

In [209]:
121-32

89

In [211]:
len(df_salesforce[df_salesforce.articles==''])-32

17

In [213]:
121-32

89

In [215]:
print('Overall we have',round(((89-17)/89)*100,2),'% of articles which are divided into case studies and press releases')

Overall we have 80.9 % of articles which are divided into case studies and press releases


<a id='clean'></a>
## Cleaning articles in the dataframe: 

In [197]:
df_salesforce.articles[1]

'\n\nCompanies across industries and regions are looking to accelerate their digital transformations in the face of the pandemic. This makes the software developer’s role all the more crucial to the success of the business, as developers will largely be responsible for building the apps and processes that help their organizations bridge the divide, remain connected to their customers, and stay relevant in this digital-first world.\nThis only works if they have the right platform to be successful though, and that includes everything from pro-code tools for building B2C scale consumer apps, to low-code declarative tools for quickly building business processes.\xa0\nToday’s developers expect languages, tools, and seamless deployment options that span the entire range of skills, from low-code app builder to full stack developer. And, with demand for developers slated to grow 22% annually between now and 2029, this growing need will make it even more important for companies to provide the r

In [198]:
df_salesforce.articles = df_salesforce.articles.str.replace('\n','')
df_salesforce.articles = df_salesforce.articles.str.replace('\t','')
df_salesforce.articles = df_salesforce.articles.str.replace('\xa0','')

In [200]:
df_salesforce.articles[1]

'Companies across industries and regions are looking to accelerate their digital transformations in the face of the pandemic. This makes the software developer’s role all the more crucial to the success of the business, as developers will largely be responsible for building the apps and processes that help their organizations bridge the divide, remain connected to their customers, and stay relevant in this digital-first world.This only works if they have the right platform to be successful though, and that includes everything from pro-code tools for building B2C scale consumer apps, to low-code declarative tools for quickly building business processes.Today’s developers expect languages, tools, and seamless deployment options that span the entire range of skills, from low-code app builder to full stack developer. And, with demand for developers slated to grow 22% annually between now and 2029, this growing need will make it even more important for companies to provide the right tools t

In [207]:
df_salesforce.url[4]

'https://sforce.co/3oH7mIA'

In [208]:
df_salesforce.articles[4]

''

<a id='pkl'></a>
## Pickling dataframe to begin the work of the classifier in a new file:

In [202]:
import pickle

filename = 'Entire_Dataset.pkl'

out = open(filename,'wb')

pickle.dump(df_salesforce,out)
out.close()