## Model Application Phase: AI Landscape data source

### Author: Chaitali Suhas Bagwe (cbagwe@mail.uni-paderborn.de)

***
#### Import necessary libraries

In [2]:
import pandas as pd
import pickle
from selenium import webdriver
from data_processing import get_cleaned_webdata
from data_extraction import extraction_on_cleaned_webdata
from similarity_matching import find_clusters

***
#### Initialize a dataframe with two empty columns for storing Company Name and its Link

In [3]:
landscape_df = pd.DataFrame(columns=['Company', 'Link'])

company_name = []
company_link = []

***
#### Extracting data from AI landscape website using Selenium

In [16]:
# Initialize chorme driver and add wait
browser = webdriver.Chrome()
browser.implicitly_wait(10)

# Add the AI landscape website - Choice between:
# 1) Europe      : https://www.ai-startups-europe.eu/
# 2) Germany     : https://www.ai-startups.de/
# 3) Sweden      : https://www.ai-startups.se/
# 4) Netherlands : https://www.ai-startups-europe.eu/nl
# 5) Norway      : https://www.ai-startups-europe.eu/no
# 6) France      : https://www.ai-startups.fr/
browser.get("https://www.ai-startups.de/")

# Handle cookies selection box
browser.find_element_by_xpath('//*[@id="uc-btn-accept-banner"]').click()

# Extracting companies using html class name tag
elements = browser.find_elements_by_class_name("startupWebsite")
browser.execute_script("window.scrollBy(0, 2350)", "");


for element in elements:
    company_name.append(element.text)
    company_link.append(element.get_attribute("href"))

# Closing the driver
browser.quit()

#### Storing the extracted companies data in the dataframe initialized before

In [5]:
landscape_df["Company"] = company_name
landscape_df["Link"] = company_link

#### Get the webdata, clean it and store it in the dataframe

In [6]:
landscape_df["WebData"] = get_cleaned_webdata(landscape_df)

#### Extract keywords from the cleaned webdata and store it in the dataframe

In [7]:
extraction_on_cleaned_webdata(landscape_df, "keywords")

#### Get the stored model and vectorizer from the local desktop

In [8]:
pickled_model = pickle.load(open('finalized_model.sav', 'rb'))
pickled_vectorizer = pickle.load(open('vectorizer.sav', 'rb'))

#### Fit the input to the trained ML model into vectorizer. Then perform prediction.

In [9]:
test_input = pickled_vectorizer.transform(landscape_df["Keywords"]).toarray().tolist()
prediction = pickled_model.predict(test_input)
print(prediction)

[1. 0. 0. 1. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 1. 1. 0. 1. 1. 1. 0. 1. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 1. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 1. 0. 1. 0. 0. 0. 1. 1. 0. 0. 0. 0. 1.
 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 1. 1. 0. 0. 0. 0.
 0. 0. 0. 1.]


#### Store the predicted values in the dataframe

In [10]:
landscape_df["Prediction"] = prediction

#### Print the dataframe of AI companies

In [11]:
landscape_df

Unnamed: 0,Company,Link,WebData,Keywords,Prediction
0,Askby,http://askby.ai/,,,1.0
1,Back,https://backhq.com/,back one stop shop employee needs joining forc...,back employee employees people experience work...,0.0
2,Charles,https://hey-charles.com/,charles skip content menu cart men shirts oxfo...,shirts cashmere turtleneck superpants men wome...,0.0
3,Curiosity,http://curiosity.ai/,curiosity copy code download sign new pricing ...,minutes report files find marketing meeting pd...,1.0
4,Deepset,http://deepset.ai/,enterprise ml nlp products solutions semantic ...,nlp cloud deepset question model answering sem...,0.0
...,...,...,...,...,...
95,Explosion,http://explosion.ai/,explosion makers spacy prodigy ai nlp develope...,explosion learning developer language makers s...,0.0
96,AIME,https://aime.info/,aime deep learning workstations servers gpu cl...,nvidia rtx gpu aime learning deep ssd server c...,0.0
97,Blickfeld,http://blickfeld.com/,blickfeld lidar smart efficient digital world ...,lidar blickfeld software sensors find cube sol...,0.0
98,Evocortex,https://evocortex.org/,evocortex industrial automation mobile robotic...,evocortex localization sensor robotics softwar...,0.0


#### Extract only those companies which are predicted as product creation companies by the ML model and store it in an empty dataframe

In [12]:
landscape_prodEngg_df = landscape_df[landscape_df.Prediction == 1.0]

#### Find the main and sub clusters of these companies

In [13]:
main_cluster_labels, sub_cluster_labels = find_clusters(landscape_prodEngg_df)

#### Store the found main and sub clusters into datafram

In [14]:
landscape_prodEngg_df["Main_Cluster"] = main_cluster_labels
landscape_prodEngg_df["Sub_Cluster"] = sub_cluster_labels

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  landscape_prodEngg_df["Main_Cluster"] = main_cluster_labels
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  landscape_prodEngg_df["Sub_Cluster"] = sub_cluster_labels


#### Print the dataframe of product creation companies

In [15]:
landscape_prodEngg_df

Unnamed: 0,Company,Link,WebData,Keywords,Prediction,Main_Cluster,Sub_Cluster
0,Askby,http://askby.ai/,,,1.0,Strategy Product Planning,Customer Market Analysis
3,Curiosity,http://curiosity.ai/,curiosity copy code download sign new pricing ...,minutes report files find marketing meeting pd...,1.0,Strategy Product Planning,Customer Market Analysis
9,Eagle AI,https://eagle-ai.de/,eagle ai ai data strategies sky munich germany...,data eagle gallery projects tools information ...,1.0,Strategy Product Planning,Customer Market Analysis
18,Attention Insight,https://attentioninsight.com/,attention insight heatmaps ai driven pre launc...,attention insight images concepts pages produc...,1.0,Strategy Product Planning,Customer Market Analysis
28,Alpas,https://alpas.ai/,alpas sourcing means power product company go ...,revenue annual mill founded alpas supplier sou...,1.0,Strategy Product Planning,Customer Market Analysis
29,Ava,https://ava.info/,,,1.0,Strategy Product Planning,Customer Market Analysis
31,Celus,http://www.celus.io/,server error website provider havingtrouble lo...,celus page request ead address www server erro...,1.0,Product Development,Requirements Management
32,DeepAtom,https://deepatom.ai/,error connectyourdomain occurred regardless re...,error browser connectyourdomain occurred recom...,1.0,Product Development,Testing
33,Ada,http://ada.com/,,,1.0,Strategy Product Planning,Customer Market Analysis
35,Aignostics,http://www.aignostics.com/,aignostics skip content technology partner men...,settings data aignostics pathology clinical mo...,1.0,Product Development,Testing
