# Assigning a company to an industry

We have previously used keywords search in the descriptions to guess the companies' industries. However with our keyword selection we only got 200 companies tagged properly, knowing that we have 825 jobs to tag in total.
We are trying here to find a way to increase this number. 

The website used to get the clusters is: https://berlin.startups-list.com/

**Imports**

In [27]:
import pandas as pd
import requests
from bs4 import BeautifulSoup
page = 'https://berlin.startups-list.com/'
resp = requests.get(page)
soup = BeautifulSoup(resp.text, 'lxml')

import warnings
warnings.filterwarnings('ignore')


**Extract the whole nav class holding the pages links**

In [2]:
ind = soup.find_all("nav", class_="categories")

In [3]:
ind

[<nav class="categories">
 <a class="btn btn-outline " href="/" style="margin-right:5px; ">
         All Startups</a>
 <a class="btn btn-outline " data-id="id3" href="/startups/mobile" style="margin-right:5px; ">Mobile </a>
 <a class="btn btn-outline " data-id="id29" href="/startups/e-commerce" style="margin-right:5px; ">E-Commerce </a>
 <a class="btn btn-outline " data-id="id10" href="/startups/SaaS" style="margin-right:5px; ">SaaS </a>
 <a class="btn btn-outline " data-id="id162" href="/startups/marketplaces" style="margin-right:5px; ">Marketplaces </a>
 <a class="btn btn-outline " data-id="id4" href="/startups/digital_media" style="margin-right:5px; ">Digital Media </a>
 <a class="btn btn-outline " data-id="id6" href="/startups/social_media" style="margin-right:5px; ">Social Media </a>
 <a class="btn btn-outline " data-id="id43" href="/startups/education" style="margin-right:5px; ">Education </a>
 <a class="btn btn-outline " data-id="id12" href="/startups/enterprise_software" style=

**Create a Dataframe with the industries and respective page's URL**

In [4]:
url=[]
indu = []
for x in ind:
    for a in x.find_all("a"):
        u = 'https://berlin.startups-list.com'+ a['href'] 
        url.append(u)
        indu.append(a.get_text())

In [5]:
df = pd.DataFrame({'Industry': indu,'url': url})
df=df.iloc[1:,:]
df.head()

Unnamed: 0,Industry,url
1,Mobile,https://berlin.startups-list.com/startups/mobile
2,E-Commerce,https://berlin.startups-list.com/startups/e-co...
3,SaaS,https://berlin.startups-list.com/startups/SaaS
4,Marketplaces,https://berlin.startups-list.com/startups/mark...
5,Digital Media,https://berlin.startups-list.com/startups/digi...


**Take each url and extract from it the list of companies**

In [6]:
df = df[df['url'] != 'https://berlin.startups-list.com#']

In [7]:
dicti =[]
for x in list(df['url']): 
    sep = ' in Berlin\n\n'
    sep2 = '<title>\n        \n           \n\n              '
    resp = requests.get(x)
    soup2 = BeautifulSoup(resp.text, 'lxml')
    h1s= soup2.find_all("h1", property="name")
    cate = soup2.find_all('title')
    for u in h1s:
        for y in cate:
            u = u.text.strip()
            o = str(y).strip()[44:].split('\n\n',1)[0]
            dicti.append((u,o))
        
    

In [8]:
aa = pd.DataFrame(dicti, columns =('Companies','Industries'))

In [9]:
aa.drop_duplicates(inplace=True)

In [10]:
len(aa)

3372

In [11]:
aa = aa[aa['Industries']!= 'in\n            •  •\n             Berlin Startups List\n        ']

**Print the Data Frame with the all companies listed on the webite and their respective industry.**

In [12]:
aa.head()

Unnamed: 0,Companies,Industries
0,Navegas Media UG,Mobile Companies in Berlin
1,Re2you,Mobile Companies in Berlin
2,HIGH MOBILITY,Mobile Companies in Berlin
3,shoutr labs UG,Mobile Companies in Berlin
4,Vamos - The Event Guide,Mobile Companies in Berlin


**Some companies have multiple industries, we are then using the counts to prioritise**

In [13]:
counting = pd.DataFrame(aa['Industries'].value_counts()).reset_index()

In [14]:
counting.columns =('Industries','Counts')

In [15]:
counting.head()

Unnamed: 0,Industries,Counts
0,Mobile Companies in Berlin,132
1,E-commerce Companies in Berlin,117
2,SaaS Companies in Berlin,71
3,Marketplaces Companies in Berlin,55
4,Digital media Companies in Berlin,52


** We merge the two tables aa and counting, sort the new table to get for each company the industry with the highest count first.**

In [16]:
df = pd.merge(aa,counting, how = 'left', on = 'Industries')

In [17]:
df = df.sort_values(['Companies','Counts'],ascending= False)

In [18]:
df_final = df.drop_duplicates(subset='Companies', keep="first")

**Create a table with only the variable we need**

In [19]:
industry_rawdata = df_final[['Industries','Companies']]
industry_rawdata.columns=('group','company_name')

**Import the original job table and extract the companies and id variables**

In [20]:
raw_data = pd.read_csv('data-cleaning/Clean_JobTitles/dfclean.csv', index_col=0).reset_index()

In [21]:
raw_data.head()

Unnamed: 0,ID,company_name,date,description,jobtitle,CleanTitles,source
0,1,Fatmap,2017-10-02,Role & Responsibility: \n\nYou’ll be building ...,mobile engineer,Mobile App Developer,Berlin Startup Jobs
1,2,AI Engine,2017-10-02,AI Engine is developing innovative machine lea...,machine learning,Machine Learning Engineer,Berlin Startup Jobs
2,3,November,2017-10-02,Your mission:\n\nDevelopment of a scalable sof...,senior full stack php developer (f m),Full Stack Developer,Berlin Startup Jobs
3,5,CrossEngage,2017-10-02,About CrossEngage\nCrossEngage is a cloud-base...,data scientist,Data Scientist,Berlin Startup Jobs
4,6,Ruum,2017-10-02,If you’re passionate about learning new techno...,full stack developer,Full Stack Developer,Berlin Startup Jobs


In [22]:
#industry_model = industry_rawdata.set_index('company_name').to_dict()['group']


In [28]:
industry_rawdata['company_name'] = industry_rawdata['company_name'].str.lower()

In [29]:
industry_table = raw_data[['ID','company_name']]

In [30]:
industry_table['company_name'] = industry_table['company_name'].str.lower()
industry_table.head()

Unnamed: 0,ID,company_name
0,1,fatmap
1,2,ai engine
2,3,november
3,5,crossengage
4,6,ruum


In [31]:
industry_rawdata.head()

Unnamed: 0,group,company_name
1942,Clean technology Companies in Berlin,überlin
298,SaaS Companies in Berlin,zefly.com
208,E-commerce Companies in Berlin,yourpainting
592,Fashion Companies in Berlin,youbl
110,Mobile Companies in Berlin,you & the gang


In [32]:
cc = pd.merge(industry_table,industry_rawdata, how='left', on='company_name').dropna()

In [33]:
len(cc)

101

In [34]:
len(industry_table['company_name'])

825

In [35]:
industry_table['company_name'].nunique()

414

In [36]:
cc.head()

Unnamed: 0,ID,company_name,group
13,15,gameduell,Mobile games Companies in Berlin
15,17,iplytics,Business services Companies in Berlin
16,18,chartmogul,E-commerce Companies in Berlin
17,19,priori data,Mobile Companies in Berlin
18,20,priori data,Mobile Companies in Berlin


**Only 101 companies have been tagged properly out of the 825 total. We are not going to use this website for the industry table creation.**