In [4]:
import pandas as pd

df_tweets=pd.read_csv("./data/all_tweets.csv", sep=";")

### Objectif : Identification des levées de fonds
   
### Format : Classification des données en tableau (type csv)

### Critères : Nom entreprise, Date, Montant, Type levée (eg. Série A,B ...), Nom des investisseurs

# Data Quality Check

In [187]:
df_tweets.isnull().sum()

date       0
user       0
tweet    103
dtype: int64

Observation : We should kill all tweets that have no text, as they add no value to our process. Indeed, our goal is to extract value from text data, and we can't do that without text

In [188]:
df_tweets = df_tweets.dropna(axis = 0, how ='any') 

# Crude Data Exploration

In [11]:
df_tweets.user.unique()

array(['techcrunch', 'crunchbase', 'lesechos', 'finextra', 'davidbrear',
       'eileentso', 'obussmann', 'cgledhill', 'gcgodfrey', 'jimmarous',
       'spirosmargaris', 'chris_skinner', 'softbank', 'wearebankable',
       'annairrera', 'fintechch', 'devie_mohan', 'boursorama',
       'revolutapp', 'fintech_futures', 'invyo_insights', 'brettking',
       'sytaylor', 'leimer', 'elonmusk', 'duenablomstrom', 'fgraillot',
       'finmktg', 'sbmeunier', 'rshevlin', 'minh_q_tran', 'dgwbirch',
       'clagett', 'matteorizzi', 'lizlum', 'amittwitr', 'ftpartners',
       'sammaule', 'jpnicols', 'tek_fin', 'barkowconsult',
       'vitalikbuterin', 'andi_staub', 'kvanderhoydonk', 'ronald_vanloon',
       'ralexjimenez', 'wfsullivan3', 'mikequindazzi', 'simoncocking',
       'susannechishti', 'psb_dc', 'mikebutcher', 'fredwilson',
       'thomaspower', 'robfindlay', 'nigelwalsh', 'visible_banking',
       'ambajorat', 'sabinevdl', 'guzmand', 'frankjschwab',
       'therudingroup', 'thinkpayments',

Observation 1 : Different users means different ways of structuring tweets. Presence of tweets from English and French speakers means potentially different languages. Expecting complications due to differences in users.

In [15]:
df_tweets.tweet[0]

'trendy luggage brand away packs on $100 m , rolls past $1.4 b valuation https ://tcrn.ch/2ljr6rj by @kateclarktweets pic .twitter.com/mxvdryqxj1'

Observation 2 : Some parts of the things to extract aren't in the tweet. Example : Here in tweet 1, company name is absent. We expect to have to scrape our data from the links in the tweet. => URL extractor per tweet.

Observation 3 : URL format doesn't seem reliable. "https ://tcrn.ch/2ljr6rj" is not a URL. This will require a reflexion on the best way to clean URLs after extraction and before scraping.

Observation 4 : In the same tweet, money can be mentioned multiple times. We can't trust a simple regex extractor. We have to dig deeper to make sure it's actually raising money and not company valuation

Observation 5 : Bitly URLs are case sensitive. All tweets were preprocessed and turned to lowercase. Essentially any URL that is case sensitive is completely useless to us.

Observation 6 : SpaCy and NLTK's models are case sensitive. It means that if we're trying to extract investors names and company names using NER models, this kind of data won't lead us too far.

### Next steps

We fix Observation 5 by rescraping all tweets. Solution used here is GetOldTweets3, allowing for the acquisition of old tweets. It will also fix Observation 3. We will then extract URLs and fix Observation 2.

In [139]:
from datetime import datetime, timedelta

def add_days(str_date, days=1, date_format="%Y-%m-%d"):
    date = datetime.strptime(str_date, date_format)
    modified_date = date + timedelta(days=days)
    return datetime.strftime(modified_date, date_format)

In [223]:
import GetOldTweets3 as got

def tweet_acquisition(row):
    username = row["user"]
    text_query = " ".join(row["tweet"].split()[0:6]) 
    str_date_since = row["date"][0:10]
    str_date_until = add_days(str_date_since)

    count=1

    tweetCriteria = (got.manager.TweetCriteria()
                                .setQuerySearch(text_query)
                                .setUsername(username)
                                .setSince(str_date_since)
                                .setUntil(str_date_until)
                                .setMaxTweets(count))
    # Creation of list that contains all tweets
    tweets = got.manager.TweetManager.getTweets(tweetCriteria)
    # Creating list of chosen tweet data
    text_tweets = {"tweet" : tweet.text for tweet in tweets}.get("tweet", "")
    return text_tweets

In [162]:
# df_tweets=df_tweets.reset_index()

In [151]:
df_tweets["raw_tweets"]=df_tweets.apply(tweet_acquisition, axis=1)

0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
27

SystemExit: 

  warn("To exit: use 'exit', 'quit', or Ctrl-D.", stacklevel=1)


In [221]:
df_tweets_subset=df_tweets.head(300)

In [224]:
df_tweets_subset["raw_tweets"]=df_tweets_subset.apply(tweet_acquisition, axis=1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


In [163]:
df_tweets=df_tweets[[i for i in df_tweets if i!="index"]]

Observation 7 : Solution failed due to having made too many requests to twitter after 590 requests. The following work will extracts URLs but only as an academic exercice, knowing full well we simply can't open them.

### Next steps : 


We use a subset of data and extract URLs. If Tweet data was uncleaned, we could have scraped the articles contained in all URLs and acquire everything we need. Now we are limited in scope to the actual data in our tweets dataset. 

In [247]:
import re

str_url_regex='(?:http[s]?)?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'
url_matcher = re.compile(str_url_regex)

def url_extractor(text, url_matcher=url_matcher):
    try:
        return url_matcher.findall(text)[0]
    except:
        return ""

In [248]:
df_tweets_subset=df_tweets_subset.copy()

In [249]:
df_tweets_subset["urls"]=df_tweets_subset.raw_tweets.apply(url_extractor)

### Conclusion

We have managed to extract URLs from our tweets. Now we will try to extract company names, series (A or B), funds raised, investor names and dates

#### Funds :

We'll try to see if we can extract numerals that are followed by b or m or a currency symbol

In [83]:
import regex

str_money_regex=r'\p{Sc}\s?\d+(?:\.\d+)?\s?(?:million|billion|M|m|B|b|Million|Billion)?(?:\s(?:dollars|euros|pounds|USD|EUR|GBP))?'

def funds_extractor(text):
    try:
        return regex.findall(str_money_regex, text)
    except:
        return ""

In [275]:
df_tweets_subset["funds"]=df_tweets_subset.raw_tweets.apply(funds_extractor)

In [279]:
sum(df_tweets_subset["funds"].str.len()==0)

268

Observation : Tweets won't lead us very far. Extraction using URLs seems like a must.

##### Scraping from URLs 

We'll use beautifulsoup to handle tag extraction. The aim will be to extract the title and body of text in our targets. From there we can apply NER models and regular expressions to get as much information we can get. 

In [290]:
from bs4 import BeautifulSoup

The top 300 tweets are 100% from techcrunch. Therefore we'll specialize a little towards the particular issues we can meet when scraping from techcrunch, however we should keep in mind that those issues are similar accross other websites

When using selenium, techncrunch introduces us first to some kind of privacy policy page, where we have to validate using a button.

In [22]:
from selenium import webdriver

driver = webdriver.Chrome('C:/Users/Thrall/chromedriver/chromedriver.exe')
driver.get("https://techcrunch.com/")
driver.find_element_by_name('agree').click()

In [None]:
#<time class="full-date-time"

In [132]:
def extract_body(str_url):

    if str_url=="":
        return ""
    try:
        driver.get(str_url)
    except :
        return ""
    html = driver.page_source
    
    soup = BeautifulSoup(html, 'lxml')
    title = soup.title
    try:
        body = soup.findAll(attrs={"class" : "article-content"})[0].findAll("p")
    except: 
        body=""
    try:
        date_str = soup.findAll(attrs={"class" : "full-date-time"})
    except: 
        date_str=""
        
    return {"body":body, "title":title, "date" : date_str}

In [133]:
df_tweets_subset["bodies"]=df_tweets_subset.urls.apply(extract_body)

Observation : Project is bigger than anticipated. Sleep-checkpoint required.

In [440]:
df_tweets_subset.to_csv("./data/checkpoint.csv", sep=";", index=False)

### Next steps : 

Extract information from bodies. Funds, companies names, Series, so on and so forth.

In [1]:
import pandas as pd

df_tweets_subset = pd.read_csv("./data/checkpoint.csv", sep=";")

In [135]:
a=df_tweets_subset.bodies[0]

In [143]:
from bs4 import BeautifulSoup

from functools import reduce
from operator import iconcat
import spacy

nlp = spacy.load("en_core_web_sm")

### raw_data => parsed_data

This regular expression will detect series

In [100]:
import re

str_series_regex='seed|series [A-Z]|mezzanine|IPO|public'

series_matcher = re.compile(str_series_regex, re.IGNORECASE)

def series_extractor(text, series_matcher=series_matcher):
    try:
        return series_matcher.findall(text)[0]
    except:
        return ""

In [180]:
metadata={"raw": {}, "parsed" : {}, "edited" : {}}

metadata["raw"]=a

we parse a paragraph looking for investors using both techcrunch tags and spacy

In [205]:
def parse_investors(p):
    
    investors=p.findAll(attrs={"class" : "crunchbase-link", "data-type":"organization" })

    spacy_labels = [ent.text.strip() for ent in nlp(p.text).ents if ent.label_ in ["ORG", "PERSON"]]

    parsed_investors=[]
    if investors:
        parsed_investors.extend([j.contents[0].split(",")[0].strip()  for j in investors])
    if spacy_labels:
        parsed_investors.extend(spacy_labels)
    if not parsed_investors:
        return False
    return parsed_investors

we parse funds and series using regular expressions

In [210]:
def parse_funds(p):
    funds=[funds_extractor(text) for text in p.contents]
    funds=reduce(iconcat, funds, [])
    return funds

In [214]:
def parse_series(p):
    series = [series_extractor(text) for text in p.contents]
    series=[serie for serie in series if serie!=""]
    return series

we parse the title and date if possible

In [400]:
def parse_datetitle(raw_metadata):
    try:
        date_temp=raw_metadata["date"][0]["datetime"]
    except:
        date_temp=""
    return {
        "date" : date_temp,
        "title" : raw_metadata["title"].text
    }

In [476]:
def parse_metadata(raw_metadata):
    parsed_metadata={}
    i=1
    if raw_metadata== "":
        return ""
    for p in raw_metadata["body"]:
        parsed_metadata[i]={}
        if i==1:
            try:
                parsed_metadata[i]["name_candidates"]=p.find("a").contents[0].text
            except:
                pass
            
        parsed_investors = parse_investors(p)
        if parsed_investors:
            parsed_metadata[i]["investors"]=parsed_investors
            
        funds=parse_funds(p)
        if funds:
            parsed_metadata[i]["funds"]=funds

        series = parse_series(p)
        if series:
            parsed_metadata[i]["series"]=series

        i=i+1
    parsed_metadata[0]=parse_datetitle(raw_metadata)
    
    return parsed_metadata

test for sanity check

In [217]:
metadata["parsed"]=parse_metadata(metadata["raw"])

### parsed_data => edited_data

we extract units from a funds string

In [242]:
str_unit_regex='million|billion|M|B'

unit_matcher = re.compile(str_unit_regex, re.IGNORECASE)

def unit_extractor(text, unit_matcher=unit_matcher):
    try:
        return unit_matcher.findall(text)[0]
    except:
        return ""

we extract the amount from the funds string

In [271]:
str_amount_regex='[0-9]+(?:,|\.)?[0-9]*'

amount_matcher = re.compile(str_amount_regex, re.IGNORECASE)

def amount_extractor(text, amount_matcher=amount_matcher):
    try:
        return float(amount_matcher.findall(text)[0].replace(",", "."))
    except:
        return ""

this transforms units into a usable number for comparisons

In [302]:
units_converter={
    "billion" : 1000000000,
    "b" : 1000000000,
    "million" : 1000000,
    "m" : 1000000,
}

we'll use this to extract currencies

In [303]:
str_currency_regex=r'\p{Sc}|(?:dollars|euros|pounds|USD|EUR|GBP)'

def currency_extractor(text):
    try:
        return regex.findall(str_currency_regex, text)[0]
    except:
        return ""

#### We'll use the following rule to determine in which series we are : 

* If there are multiple series in the text, pick the first one associated with funds.

* If there are multiple series in the paragraph, pick the most recent (example : mezzaine > series Z )

#### We'll use the following rule to determine which funds to pick :

* If there are multiple funds in the text, pick the first one associated with a valid series

* If there are multiple funds in the paragraph, pick the smallest one (to avoid picking valuation instead of fund raiser)

In [436]:
import string

letters = {"series {}".format(letter):number for letter,number in zip(string.ascii_lowercase,range(1,28))}

series_converter={
    "seed":0,
    "mezzanine":27,
    "ipo":28,
    "public":29
}

series_converter.update(letters)

inv_series_converter = {v: k for k, v in series_converter.items()}

Gateway function (didn't get the time to break it down in little functions)

In [473]:
def edit_metadata(parsed_metadata):
    if parsed_metadata=="":
        return ""
    edited_metadata={}

    funds_not_found=True
    series_not_found=True
    just_found_series=False
    company_name=None
    if 1 in parsed_metadata.keys():
        company_name=parsed_metadata[1].get("name_candidates")

    for i in range(1, max(parsed_metadata.keys())+1):
        just_found_series=False
        funds=parsed_metadata[i].get("funds")
        series = parsed_metadata[i].get("series")
        investors =parsed_metadata[i].get("investors")
        if not company_name:
            list_all_investors=[]
            for key in parsed_metadata.keys():
                list_all_investors.extend(parsed_metadata[key].get("investors",[]))
            list_all_investors=[investor.split("’s")[0] for investor in list_all_investors if len(investor)>2]
            company_name=max(set(list_all_investors), key=list_all_investors.count)
                
        if series and series_not_found:
            just_found_series=True
            series_not_found=False
            series=[series_converter[serie.lower()] for serie in series]
            edited_metadata["series"]=inv_series_converter[max(series)]

        if funds and funds_not_found and just_found_series:
            units=[(amount_extractor(fund), unit_extractor(fund), currency_extractor(fund)) for fund in funds]
            units=[(unit[0]*units_converter[unit[1].lower()],unit[2] )for unit in units if unit[0]!="" and unit[1]!=""]
            if units:
                min_amount=min(units)[0]
                min_currency=[unit[1] for unit in units if unit[0]==min_amount][0]
                raised_funds="{} {}".format(str(min_amount), min_currency)
                edited_metadata["raised_funds"]=raised_funds
                funds_not_found=False
            else:
                series_not_found=True

        if investors and just_found_series:
            investors=[investor.split("’s")[0] for investor in investors if investor!=company_name and investor!="IPO" and len(investor)>2]
            edited_metadata["investors"]=set(investors)

    edited_metadata["date"]=parsed_metadata[0]["date"]
    if company_name:
        edited_metadata["company_name"]=company_name
    return edited_metadata

Sanity check

In [412]:
edit_metadata(metadata["parsed"])

{'series': 'series d',
 'raised_funds': '100000000.0 $',
 'investors': ['Baillie Gifford',
  'The Wall Street Journal',
  'Instagrammable',
  'Wellington Management',
  'Baillie Gifford',
  'Global Founders Capital',
  'The Wall Street Journal'],
 'date': '2019-05-14T23:45:41'}

### Applying pipeline

#### raw_data => parsed_data

In [477]:
df_tweets_subset["parsed_data"]=df_tweets_subset["bodies"].apply(parse_metadata)

#### parsed_data => edited_data

In [478]:
df_tweets_subset["edited_data"]=df_tweets_subset["parsed_data"].apply(edit_metadata)

Observation : Project is bigger than anticipated. Sleep-checkpoint required.

In [444]:
df_tweets_subset.to_csv("./data/checkpoint.csv", sep=";", index=False)

Cleaning up

In [480]:
edited_data=pd.DataFrame(df_tweets_subset["edited_data"].apply(lambda x : dict(x) if type(x)==str else x).to_list())

In [484]:
staging_data=pd.concat([df_tweets_subset, edited_data], axis=1)

In [485]:
staging_data

Unnamed: 0,date,user,tweet,urls,raw_tweets,funds,bodies,raw_data,parsed_data,edited_data,company_name,date.1,investors,raised_funds,series
0,2019-05-14 23:47:10,techcrunch,"trendy luggage brand away packs on $100 m , ro...",https://tcrn.ch/2LJR6rJ,"Trendy luggage brand Away packs on $100M, roll...","['$100M', '$1.4B']","{'body': [<p id=""speakable-summary""><a href=""h...","{1: {'name_candidates': Away}, 2: {'investors'...","{1: {}, 2: {'investors': ['Baillie Gifford', '...","{'series': 'series d', 'raised_funds': '100000...",Away,2019-05-14T23:45:41,"{Global Founders Capital, Baillie Gifford, Ins...",100000000.0 $,series d
1,2019-05-14 22:49:52,techcrunch,"crowdstrike , a cybersecurity unicorn , files ...",https://tcrn.ch/2W54ivc,"CrowdStrike, a cybersecurity unicorn, files to...",[],"{'body': [<p id=""speakable-summary"">If you tho...","{1: {'name_candidates': disastrous, 'investors...","{1: {'investors': ['Uber'], 'series': ['public...","{'series': 'public', 'investors': {'Uber'}, 'd...",CrowdStrike,2019-05-14T22:48:29,{Uber},,public
2,2019-05-14 22:01:37,techcrunch,san francisco passes city government ban on fa...,https://tcrn.ch/2WIrbBv,San Francisco passes city government ban on fa...,[],"{'body': [<p id=""speakable-summary"">On Tuesday...",{1: {'name_candidates': Stop Secret Surveillan...,{1: {'investors': ['San Francisco’s Board of S...,"{'series': 'public', 'date': '2019-05-14T22:01...",Peskin,2019-05-14T22:01:10,,,public
3,2019-05-14 22:01:00,techcrunch,preparing for a future of drone -filled skies ...,https://tcrn.ch/2vV3dHy,Preparing for a future of drone-filled skies h...,[],"{'body': [<p id=""speakable-summary"">The last f...","{1: {'name_candidates': shot down, 'investors'...","{1: {'investors': ['Gatwick'], 'series': ['ser...","{'series': 'series o', 'investors': {'Gatwick'...",FAA,2019-05-14T22:00:24,{Gatwick},,series o
4,2019-05-14 21:26:26,techcrunch,stocks gain back some ground as investors asse...,https://tcrn.ch/2Yqf10E,Stocks gain back some ground as investors asse...,[],"{'body': [<p id=""speakable-summary"">Stocks had...","{1: {}, 2: {'investors': ['S&P']}, 3: {}, 4: {...","{1: {}, 2: {'investors': ['S&P']}, 3: {}, 4: {...","{'date': '2019-05-14T21:24:57', 'company_name'...",The New York Times,2019-05-14T21:24:57,,,
5,2019-05-14 20:35:04,techcrunch,this jam -packed agenda is overflowing with so...,,,[],,,,,,,,,
6,2019-05-14 20:32:01,techcrunch,apply now for startup battlefield at disrupt s...,https://tcrn.ch/2WPAX5b,Apply now for Startup Battlefield at Disrupt S...,[],"{'body': [<p id=""speakable-summary"">We’re look...","{1: {'name_candidates': Startup Battlefield, '...","{1: {'investors': ['Disrupt']}, 2: {'investors...","{'date': '2019-05-14T20:30:17', 'company_name'...",Startup Battlefield,2019-05-14T20:30:17,,,
7,2019-05-14 20:17:39,techcrunch,innowatts raises $18 million for its energy mo...,https://tcrn.ch/2W6eYcT,Innowatts raises $18 million for its energy mo...,['$18 million'],"{'body': [<p dir=""ltr""><a class=""crunchbase-li...","{1: {'name_candidates': Innowatts, , 'investor...","{1: {'investors': ['Innowatts', 'Energy Impact...","{'date': '2019-05-14T20:16:13', 'company_name'...",Innowatts,2019-05-14T20:16:13,,,
8,2019-05-14 20:03:37,techcrunch,new relic takes a measured approach to platfor...,https://tcrn.ch/2w0kTkZ,New Relic takes a measured approach to platfor...,[],"{'body': [<p id=""speakable-summary""><a href=""h...","{1: {'name_candidates': New Relic, 'investors'...","{1: {'investors': ['SaaS']}, 2: {'investors': ...","{'date': '2019-05-14T20:02:12', 'company_name'...",New Relic One,2019-05-14T20:02:12,,,
9,2019-05-14 20:00:07,techcrunch,"beyond costs , what else can we do to make hou...",,"Beyond costs, what else can we do to make hous...",[],,,,,,,,,


In [493]:
edited_data.dropna()

Unnamed: 0,company_name,date,investors,raised_funds,series
0,Away,2019-05-14T23:45:41,"{Global Founders Capital, Baillie Gifford, Ins...",100000000.0 $,series d
15,Craftory,2019-05-14T18:24:29,{TomboyX},18000000.0 $,series b
31,Impossible Foods,2019-05-14T16:22:48,{Beyond Meat},750000000.0 $,public
92,SAP,2019-05-10T17:23:47,{the New York Stock Exchange},8100000000.0 $,public
96,Uber,2019-05-10T15:52:55,{the New York Stock Exchange},8100000000.0 $,public
110,Grain,2019-05-10T04:33:45,"{Genesis Alternative Ventures, Ozi Amanat, Sas...",10000000.0 $,series b
114,Mobike,2019-05-08T23:03:38,{Meituan-Dianping},20000000.0 $,ipo
151,HeyJobs,2019-05-08T08:00:53,{Series A},12000000.0 $,series a
153,CollegeDekho,2019-05-08T03:46:25,{},8000000.0 $,series b
174,Carta,2019-05-06T15:00:30,"{TechCrunch, Marc Andreessen, Andreessen Horow...",300000000.0 $,series e


#### Conclusion

We managed to build a dataprocess that starts with scraped twitter data and ends with raised funds and series, date, company name and investors (when applicable)

Certainly the process can be improved. But that'd require taking unprocessed tweets from different users to be able to generalize our data process for other users