# Analyzing Biotechnology Related Companies in the Proximity of Aachen, Germany

## 1) The description of the problem and a discussion of the background

A contractor is trying to start their own business in the field of biotechnology. They are young entrepreneurs and have deep knowledge about their area but they do not know exactly where to setup their laboratory. They want to know the locations of the biotechnology related companies and possibly why they are settled in that area. My aim is to provide them sufficient information about the neighborhoods of Aachen and to simplify their decision process. I am going to analyse the geospatial data in the proximity of Aachen. Besides, I will show the distribution of the company locations and explore the data. 

**Lets importe some necessary libraries**

In [None]:
import requests # library to handle requests
import pandas as pd # library for data analsysis
import numpy as np # library to handle data in a vectorized manner
import random # library for random number generation

!pip install geopy
from geopy.geocoders import Nominatim # module to convert an address into latitude and longitude values
    
# tranforming json file into a pandas dataframe library
from pandas.io.json import json_normalize

# !conda install -c conda-forge folium=0.5.0 --yes
!pip install folium
import folium # plotting library

print('Folium installed')
print('Libraries imported.')

### Define Foursquare Credentials and Version

In [None]:
CLIENT_ID = 'V4JR2CSZPYFIZFTFZTFM2QWSBN35YW4WOEMW3BSKHW0TG1OK' # your Foursquare ID
CLIENT_SECRET = 'WIVZ4KIPUMCLPLO1ZQ3TVS53TLC1S2B0O4DLZA3OCRUCL03A' # your Foursquare Secret
VERSION = '20200315'
LIMIT = 500
print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

### Lets assume we live in Aachen, Germany and want to start a bussiness nearby. First let's start by converting the Aachen`s address to its latitude and longitude coordinates 

In [None]:
# In order to define an instance of the geocoder, we need to define a user_agent. We will name our agent foursquare_agent, as shown below.

address = 'Aachen, Germany'

geolocator = Nominatim(user_agent="foursquare_agent")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('latitute and longitute of Aachen: {}, {}'.format(latitude, longitude))

## 2) A description of the data and how it will be used to solve the problem

- The contractor wants to start business in the field of biotechnology in the proximity of Aachen. For this purpose, I am going to search the venues regarding 
        - genetics
        - biology
        - biotechnology
        - bioscience
        - DNA

- I am going to introduce all venues within 100 km distance from Aachen. 

- I am going to obtain 5 data frames and then combine them together so that I can analyse easily.

### Lets start to search for a specific venue category

let's define a query to search for "genetics" related venues that are within 100km metres from the Aachen.

### 2.1. Genetics related venues

In [None]:
search_query = 'genetic'
radius = 100000
print(search_query + ' .... OK!')

Define the corresponding URL

In [None]:
url = 'https://api.foursquare.com/v2/venues/search?client_id={}&client_secret={}&ll={},{}&v={}&query={}&radius={}&limit={}'.format(CLIENT_ID, CLIENT_SECRET, latitude, longitude, VERSION, search_query, radius, LIMIT)
url

### Send the GET Request and examine the results

In [None]:
results = requests.get(url).json()
# results

### Get relevant part of JSON and transform it into a pandas dataframe

In [None]:
# assign relevant part of JSON to venues
venues = results['response']['venues']

# tranform venues into a dataframe
df_genetics = json_normalize(venues)
df_genetics.head()

### 2.2. Biology related venues

We can continue the same procedure for the biology related venues

In [None]:
search_query = 'biology'
print(search_query + ' .... OK!')
url = 'https://api.foursquare.com/v2/venues/search?client_id={}&client_secret={}&ll={},{}&v={}&query={}&radius={}&limit={}'.format(CLIENT_ID, CLIENT_SECRET, latitude, longitude, VERSION, search_query, radius, LIMIT)

results = requests.get(url).json()

# assign relevant part of JSON to venues
venues = results['response']['venues']

# tranform venues into a dataframe
df_biology = json_normalize(venues)
df_biology.head()

### 2.3. Biotechnology related venues

In [None]:
search_query = 'biotechnology'
print(search_query + ' .... OK!')

url = 'https://api.foursquare.com/v2/venues/search?client_id={}&client_secret={}&ll={},{}&v={}&query={}&radius={}&limit={}'.format(CLIENT_ID, CLIENT_SECRET, latitude, longitude, VERSION, search_query, radius, LIMIT)

results = requests.get(url).json()

# assign relevant part of JSON to venues
venues = results['response']['venues']

# tranform venues into a dataframe
df_biotechnology = json_normalize(venues)
df_biotechnology.head()

**We see that there is no biotechnology related company in the proximity of Aachen**

### 2.4. Bioscience related venues

In [None]:
search_query = 'bioscience'
print(search_query + ' .... OK!')

url = 'https://api.foursquare.com/v2/venues/search?client_id={}&client_secret={}&ll={},{}&v={}&query={}&radius={}&limit={}'.format(CLIENT_ID, CLIENT_SECRET, latitude, longitude, VERSION, search_query, radius, LIMIT)

results = requests.get(url).json()

# assign relevant part of JSON to venues
venues = results['response']['venues']

# tranform venues into a dataframe
df_bioscience = json_normalize(venues)
df_bioscience.head()

In [None]:
df_bioscience.shape

### 2.5. DNA related venues

In [None]:
search_query = 'DNA'
print(search_query + ' .... OK!')

url = 'https://api.foursquare.com/v2/venues/search?client_id={}&client_secret={}&ll={},{}&v={}&query={}&radius={}&limit={}'.format(CLIENT_ID, CLIENT_SECRET, latitude, longitude, VERSION, search_query, radius, LIMIT)

results = requests.get(url).json()

# assign relevant part of JSON to venues
venues = results['response']['venues']

# tranform venues into a dataframe
df_DNA = json_normalize(venues)
df_DNA.head()

### 2.6. Lets investigate columns of the 5 dataframes

In [None]:
print ('Are all the columns of the df_genetics and df_biology the same? ', all(df_genetics.columns == df_biology.columns))

In [None]:
print ('Are all the columns of the df_bioscience and df_DNA the same? ', all(df_bioscience.columns == df_DNA.columns))

**We see that all columns of the dataframes are not the same. Lets depict their shapes**

In [None]:
print('shape of the dataframe genetics: ', df_genetics.shape)
print('shape of the dataframe biology: ', df_biology.shape)
print('shape of the dataframe bioscience: ', df_bioscience.shape)
print('shape of the dataframe DNA: ', df_DNA.shape)

**Now lets put lables of each dataframe so that we can use this information later on**

In [None]:
df_genetics['label'] = 'genetics'
df_biology['label'] = 'biology'
df_bioscience['label'] = 'bioscience'
df_DNA['label'] = 'DNA'

In [None]:
print('shape of the dataframe genetics: ', df_genetics.shape)
print('shape of the dataframe biology: ', df_biology.shape)
print('shape of the dataframe bioscience: ', df_bioscience.shape)
print('shape of the dataframe DNA: ', df_DNA.shape)

**Lets, combine all dataframes and add a new feature called "type" as shown below**

In [None]:
df = pd.concat([df_genetics, df_biology, df_bioscience, df_DNA], ignore_index=True, sort=False)
df['type'] = 'business'
df.shape

In [None]:
df.head()

**Now our dataframe is ready to for cleaning,  EDA and further analysis**

**But before that, lets visualize these venues**

In [None]:
venues_map = folium.Map(location=[latitude, longitude], zoom_start=7) # generate map centred around Aachen

# add a red circle marker to represent Aachen
folium.vector_layers.CircleMarker(
    [latitude, longitude],
    radius=10,
    color='red',
    popup='Aachen',
    fill = True,
    fill_color = 'red',
    fill_opacity = 0.6
).add_to(venues_map)

# add the venues that are nearby as blue circle markers
for lat, lng, label in zip(df['location.lat'], df['location.lng'], df['label']):
    folium.vector_layers.CircleMarker(
        [lat, lng],
        radius=5,
        color='blue',
        popup=label,
        fill = True,
        fill_color='blue',
        fill_opacity=0.6
    ).add_to(venues_map)

# display map
venues_map

**to see the plot, please refer to the png file named as "biotechnology venues that are nearby aachen" in the my coursera capstone repocitory**

## How we are going to analyse the data?

Now, we know the locations of nearby venues of Aachen. We have seen that there is no biotechnology venue within 100 km of Aachen. Moreover, there is no any venue in the southern part of the Aachen. On the other hand, there are 33 venues that are related to my contractors business. 

Therefore, we have to explore these venues more and learn how they are related with our business.

We will investigate why there is no any venues in the souther part of Aachen

We will increase the radius to see how the distribution of the other venues changes

We will also investigate posible target markets of our business such as "genetic diagnostic centers" or "pharmaceutical companies" and their distributions

And finally, we will recommend some possible locations for their laboratory.

## 3) Biotechnology Related Venues that are within 50km from the Aachen

In [None]:
# GENETICS
search_query = 'genetic'
radius = 50000
print(search_query + ' .... OK!')
url = 'https://api.foursquare.com/v2/venues/search?client_id={}&client_secret={}&ll={},{}&v={}&query={}&radius={}&limit={}'.format(CLIENT_ID, CLIENT_SECRET, latitude, longitude, VERSION, search_query, radius, LIMIT)
results = requests.get(url).json()
venues = results['response']['venues']
df_genetics_50 = json_normalize(venues)

# BIOLOGY
search_query = 'biology'
print(search_query + ' .... OK!')
url = 'https://api.foursquare.com/v2/venues/search?client_id={}&client_secret={}&ll={},{}&v={}&query={}&radius={}&limit={}'.format(CLIENT_ID, CLIENT_SECRET, latitude, longitude, VERSION, search_query, radius, LIMIT)
results = requests.get(url).json()
venues = results['response']['venues']
df_biology_50 = json_normalize(venues)

# BIOTECHNOLOGY
search_query = 'biotechnology'
print(search_query + ' .... OK!')
url = 'https://api.foursquare.com/v2/venues/search?client_id={}&client_secret={}&ll={},{}&v={}&query={}&radius={}&limit={}'.format(CLIENT_ID, CLIENT_SECRET, latitude, longitude, VERSION, search_query, radius, LIMIT)
results = requests.get(url).json()
venues = results['response']['venues']
df_biotechnology_50 = json_normalize(venues)

# BIOSCIENCE
search_query = 'bioscience'
print(search_query + ' .... OK!')
url = 'https://api.foursquare.com/v2/venues/search?client_id={}&client_secret={}&ll={},{}&v={}&query={}&radius={}&limit={}'.format(CLIENT_ID, CLIENT_SECRET, latitude, longitude, VERSION, search_query, radius, LIMIT)
results = requests.get(url).json()
venues = results['response']['venues']
df_bioscience_50 = json_normalize(venues)

# DNA
search_query = 'DNA'
print(search_query + ' .... OK!')
url = 'https://api.foursquare.com/v2/venues/search?client_id={}&client_secret={}&ll={},{}&v={}&query={}&radius={}&limit={}'.format(CLIENT_ID, CLIENT_SECRET, latitude, longitude, VERSION, search_query, radius, LIMIT)
results = requests.get(url).json()
venues = results['response']['venues']
df_DNA_50 = json_normalize(venues)

df_genetics_50['label'] = 'genetics'
df_biology_50['label'] = 'biology'
df_biotechnology_50['label'] = 'biotechnology'
df_bioscience_50['label'] = 'bioscience'
df_DNA_50['label'] = 'DNA'

df_50 = pd.concat([df_genetics_50, df_biology_50, df_biotechnology_50, df_bioscience_50, df_DNA_50], ignore_index=True, sort=False)

# FOLIUM MAP FOR 200KM 
venues_map_50 = folium.Map(location=[latitude, longitude], zoom_start=7) # generate map centred around Aachen
# add a red circle marker to represent Aachen
folium.vector_layers.CircleMarker(
    [latitude, longitude],
    radius=10,
    color='red',
    popup='Aachen',
    fill = True,
    fill_color = 'red',
    fill_opacity = 0.6
).add_to(venues_map_50)
# add the venues that are nearby as blue circle markers
for lat, lng, label in zip(df_50['location.lat'], df_50['location.lng'], df_50['label']):
    folium.vector_layers.CircleMarker(
        [lat, lng],
        radius=5,
        color='blue',
        popup=label,
        fill = True,
        fill_color='blue',
        fill_opacity=0.6
    ).add_to(venues_map_50)
# display map
venues_map_50

**There are only 4 business venues within 50km**

## 4) Target Markets of the Business

### 4.1) Lets search for "genetic diagnostic centers", "pharmaceutical companies", "genetic hospital", "biopharma" and "life science" venues

In [None]:
# genetic diagnostic centers
search_query = 'genetic diagnos'
radius = 100000
print(search_query + ' .... OK!')
url = 'https://api.foursquare.com/v2/venues/search?client_id={}&client_secret={}&ll={},{}&v={}&query={}&radius={}&limit={}'.format(CLIENT_ID, CLIENT_SECRET, latitude, longitude, VERSION, search_query, radius, LIMIT)
results = requests.get(url).json()
venues = results['response']['venues']
df_diagnos = json_normalize(venues)

# pharmaceutical companies
search_query = 'pharmaceutical'
print(search_query + ' .... OK!')
url = 'https://api.foursquare.com/v2/venues/search?client_id={}&client_secret={}&ll={},{}&v={}&query={}&radius={}&limit={}'.format(CLIENT_ID, CLIENT_SECRET, latitude, longitude, VERSION, search_query, radius, LIMIT)
results = requests.get(url).json()
venues = results['response']['venues']
df_pharmaceutical = json_normalize(venues)

# genetic hospital
search_query = 'genetic hospital'
print(search_query + ' .... OK!')
url = 'https://api.foursquare.com/v2/venues/search?client_id={}&client_secret={}&ll={},{}&v={}&query={}&radius={}&limit={}'.format(CLIENT_ID, CLIENT_SECRET, latitude, longitude, VERSION, search_query, radius, LIMIT)
results = requests.get(url).json()
venues = results['response']['venues']
df_hospital = json_normalize(venues)

# biopharma
search_query = 'biopharma'
print(search_query + ' .... OK!')
url = 'https://api.foursquare.com/v2/venues/search?client_id={}&client_secret={}&ll={},{}&v={}&query={}&radius={}&limit={}'.format(CLIENT_ID, CLIENT_SECRET, latitude, longitude, VERSION, search_query, radius, LIMIT)
results = requests.get(url).json()
venues = results['response']['venues']
df_biopharma = json_normalize(venues)

# life science
search_query = 'life science'
print(search_query + ' .... OK!')
url = 'https://api.foursquare.com/v2/venues/search?client_id={}&client_secret={}&ll={},{}&v={}&query={}&radius={}&limit={}'.format(CLIENT_ID, CLIENT_SECRET, latitude, longitude, VERSION, search_query, radius, LIMIT)
results = requests.get(url).json()
venues = results['response']['venues']
df_life_science = json_normalize(venues)

df_diagnos['label'] = 'genetic diagnosis'
df_pharmaceutical['label'] = 'pharmaceutical'
df_hospital['label'] = 'genetic hospital'
df_biopharma['label'] = 'biopharma'
df_life_science['label'] = 'life science'

df_market = pd.concat([df_diagnos, df_pharmaceutical, df_hospital, df_biopharma, df_life_science], ignore_index=True, sort=False)
df_market.head()

In [None]:
df_market['type'] = 'target market'
df_market.head()

In [None]:
print('shape of the taget market data: ', df_market.shape)

### 4.2) Display Target Market Venues on the Map

In [None]:
# FOLIUM MAP MARKET
market_map = folium.Map(location=[latitude, longitude], zoom_start=8) # generate map centred around Aachen
# add a red circle marker to represent Aachen
folium.vector_layers.CircleMarker(
    [latitude, longitude],
    radius=10,
    color='red',
    popup='Aachen',
    fill = True,
    fill_color = 'red',
    fill_opacity = 0.6
).add_to(market_map)

# add the venues that are nearby as green circle markers
for lat, lng, label in zip(df_market['location.lat'], df_market['location.lng'], df_market['label']):
    folium.vector_layers.CircleMarker(
        [lat, lng],
        radius=5,
        color='green',
        popup=label,
        fill = True,
        fill_color='green',
        fill_opacity=0.6
    ).add_to(market_map)
# display map
market_map

### 4.3) Diplay Business and Target Market Venues on the same Map

In [None]:
business_market_map = folium.Map(location=[latitude, longitude], zoom_start=7) # generate map centred around Aachen

# add a red circle marker to represent Aachen
folium.vector_layers.CircleMarker(
    [latitude, longitude],
    radius=10,
    color='red',
    popup='Aachen',
    fill = True,
    fill_color = 'red',
    fill_opacity = 0.6
).add_to(business_market_map)

# add the business venues that are nearby as blue circle markers
for lat, lng, label in zip(df['location.lat'], df['location.lng'], df['label']):
    folium.vector_layers.CircleMarker(
        [lat, lng],
        radius=5,
        color='blue',
        popup=label,
        fill = True,
        fill_color='blue',
        fill_opacity=0.6
    ).add_to(business_market_map)

# add the target market venues that are nearby as green circle markers
for lat, lng, label in zip(df_market['location.lat'], df_market['location.lng'], df_market['label']):
    folium.vector_layers.CircleMarker(
        [lat, lng],
        radius=5,
        color='green',
        popup=label,
        fill = True,
        fill_color='green',
        fill_opacity=0.6
    ).add_to(business_market_map)
    
# display the map
business_market_map

## 5) Lets combine both dataframes

In [None]:
print('shape of the business venues dataframe: ', df.shape)
print('shape of the target market venues dataframe: ', df_market.shape)

In [None]:
df_all = pd.concat([df, df_market], ignore_index=True, sort=False)
print('shape of the business & target market venues dataframe: ', df_all.shape)

## 6) Data Wrangling
0. Change the column names
1. Check for missing data
2. Check for data formats
3. Data Standardization
4. Data Normalization
5. Binning
6. Indicator variable (or dummy variable)

In [None]:
df_all.rename(columns={"location.formattedAddress": "adress", "location.cc": "country_code", "location.distance": "distance (m)", "location.city": "city",
                       "location.country": "country", "location.lat": "latitute", "location.lng": "longitute", "location.postalCode": "postcode", 
                       "location.state": "state"}, inplace = True)
df_all.head()

### 6.1) Lets first check our dataframe for missing data
- 6.1.1 Check for 5, 10 and 20 rows
- 6.1.2 To see which of the values are missing we can use .isnull() method. 
        It will return boolean values:
        if the value is True, it means that value is missing
        otherwise, it is not empty
- 6.1.3 Count missing values 

In [None]:
n = 2
df_all.head(n)

In [None]:
# lets create missing dataframe
missing_data = df_all.isnull()
missing_data.head()

In [None]:
# Count missing values in each column
for column in missing_data.columns.values.tolist():
    print(column)
    print (missing_data[column].value_counts())
    print("")    

Based on the summary above, each column has 197 rows of data, seven columns containing missing data

These are:

- location.address:     63
- city :       43
- postcode:             77 
- state:                45 
- location.crossStreet:  191
- venuePage.id:          190
- location.neighborhood: 194

<h3 id="deal_missing_values">Deal with missing data</h3>

<b>How to deal with missing data?</b>

<ol>
    <li>drop data<br>
        a. drop the whole row<br>
        b. drop the whole column
    </li>
    <li>replace data<br>
        a. replace it by mean<br>
        b. replace it by frequency<br>
        c. replace it based on other functions
    </li>
</ol>

In [None]:
# we see all values of "hasPerk" is false, so we can drop this column since it does not give us any information
print('value counts of hasPerk: ', df_all.hasPerk.value_counts())

In [None]:
# "location.address" and "adress" columns are almost the same, so drop the "location.address"
# We have longitute and latitute seperately so no need for 'location.labeledLatLngs'. Drop also this column

In [None]:
# Lets see "location.crossStreet" which are not null (only 6 of them is not null)
# we see that the information in this column is almost the same as the adress column, so we can safely drop also this column
df_all[df_all['location.crossStreet'].notnull()]

In [None]:
# Lets see "venuePage.id" which are not null (only 7 of them is not null)
# We can safely drop also this column
df_all[df_all['venuePage.id'].notnull()]

In [None]:
# Lets see "location.neighborhood" which are not null (only 3 of them is not null)
# We can safely drop also this column
df_all[df_all['location.neighborhood'].notnull()]

In [None]:
df_new = df_all.drop(['hasPerk', 'location.address', 'location.labeledLatLngs', 'location.crossStreet', 'venuePage.id', 'location.neighborhood'], axis = 1)

In [None]:
df_new.head(3)

### Lets investigate "categories" feature in more detail

In [None]:
df_new.categories[0]

In [None]:
# at the 20th and 150th rows, the categories feature is an empty list, but it is not null !!!
print('before')
print('-'*20)
print(df_new.categories[20])
print(df_new.categories[150])

In [None]:
# lets convert them to nan, which is a better way to represent them

import warnings 
warnings.filterwarnings('ignore')

df_new.categories[20] = [{'id': np.nan,
  'name': np.nan,
  'pluralName': np.nan,
  'shortName': np.nan,
  'icon': {'prefix': np.nan,
   'suffix': np.nan},
  'primary': np.nan}]

df_new.categories[150] = [{'id': np.nan,
  'name': np.nan,
  'pluralName': np.nan,
  'shortName': np.nan,
  'icon': {'prefix': np.nan,
   'suffix': np.nan},
  'primary': np.nan}]

# print filled values
print('after')
print('-'*20)
print(df_new.categories[20])
print(df_new.categories[150])

In [None]:
df_new['category_name'] = np.nan

for i in range(0, len(df_new)):
    df_new['category_name'][i] = df_new.iloc[i,2][0]['pluralName']

In [None]:
df_new = df_new.drop('categories', axis = 1)
df_new.head()

### Lets investigate "city" feature in more detail

In [None]:
# we have 44 NaN values in city feature
print('Number of NaN values in city column: ', df_new.isnull()['city'].value_counts()[1])
print('The rest is as follows')
df_new.city.value_counts()[0:10]

**How to hande with missing data in cities?**

fill null values with the most frequent value

1. Germany

We see that in Germany; Köln, Bonn and Düsseldorf are the most frequent cities. Köln is in the center of these 3 cities and is the most frequent one

2. Netherlands

We see that in Netherlands; Eindhoven and Maastricht are the most frequent cities.


3. Belgium

We see that there is no frequent city

**Since there is no very frequent city in the dataframe, we will choose randomly from the countries**

**Although we fill them with rondom choise, the more frequent city is the more will be chosen**

**For example, Köln is the most frequent city in Germany and thus most probably it will be used to fill cities in Germany more as compared to other cities**

**This is valid also for Netherlands and Belgium**

In [None]:
for country in ['DE', 'NL', 'BE']:
    df_city_nul = df_new[df_new['city'].isnull()][df_new['country_code'] == country]
    df_city_not_null = df_new[df_new['city'].notnull()][df_new['country_code'] == country]
    cities = df_city_not_null['city'].tolist()

    nul_city_rows = df_city_nul.index.tolist()
    for row in nul_city_rows:
        df_new.loc[row, 'city'] = random.choice(cities)

In [None]:
# Now we have 0 NaN values in city feature
print('How many NaN values in city we have now? ', df_new['city'].isna().sum())
print('The rest is as follows')
df_new.city.value_counts()[0:10]

### Lets investigate "postcode" feature in more detail

In [None]:
print('We have {} null values in postcode'.format(df_new.isnull()['postcode'].value_counts()[1]))

Since it does not give us much information and there are lots of null values, we will drop postcode feature. If we try to fill these values somehow, we may bias the results 

In [None]:
df_new.drop('postcode', axis = 1, inplace = True)

### Lets investigate "state" feature in more detail

In [None]:
# if we know the the country, it is easy to guess the state like in Germany
# thus we can fill null values easliy
print('States in Germany')
state_DE = df_new[df_new.country_code == 'DE']['state']
print(state_DE.value_counts())

print()

print('States in Netherlands')
state_NL = df_new[df_new.country_code == 'NL']['state']
print(state_NL.value_counts())

print()

print('States in Belgium')
state_NL = df_new[df_new.country_code == 'BE']['state']
print(state_NL.value_counts())

**Similarly like we did in city feature:**

In [None]:
for country in ['DE', 'NL', 'BE']:
    df_state_nul = df_new[df_new['state'].isnull()][df_new['country_code'] == country]
    df_state_not_null = df_new[df_new['state'].notnull()][df_new['country_code'] == country]
    states = df_state_not_null['state'].tolist()

    nul_state_rows = df_state_nul.index.tolist()
    for row in nul_state_rows:
        df_new.loc[row, 'state'] = random.choice(states)

In [None]:
# Now we have 0 NaN values in state feature
print('How many NaN values in state we have now? ', df_new['state'].isna().sum())
print('The rest is as follows')
df_new.state.value_counts()[0:10]

### Lets investigate "referralId" feature in more detail

In [None]:
# check for the feature referralId
print('How many NaN values in referralId we have? ', df_new['referralId'].isna().sum())
print('We have: ')
df_new.referralId.value_counts()

**Lets check whether we have null values any more**

In [None]:
df_new.isna().sum()

We have 2 null values in category_name

These are:

In [None]:
df_new[df_new.category_name.isnull()]

We can safely drop these 2 rows

In [None]:
# simply drop whole row with NaN in "category_name" column
df_new.dropna(subset=["category_name"], axis=0, inplace=True)

Lets check again if we have null values

In [None]:
df_new.isna().sum()

In [None]:
df_new.head()

**Good! Now, we have obtained the dataset with no missing values.**

### 6.2) Lets check for data formats 

<p>The last step in data cleaning is checking and making sure that all data is in the correct format (int, float, text or other).</p>

In Pandas, we use 
<p><b>.dtypes</b> to check the data type</p>
<p><b>.astype()</b> to change the data type</p>

In [None]:
df_new.dtypes

### Good!

**All data in its proper format**

### 6.3) Data Standardization

<p>
Data is usually collected from different agencies with different formats.
(Data Standardization is also a term for a particular type of data normalization, where we subtract the mean and divide by the standard deviation)
</p>
    
<b>What is Standardization?</b>
<p>Standardization is the process of transforming data into a common format which allows the researcher to make the meaningful comparison.
</p>

<b>Example</b>
<p>Transform distance (m) to distance (km):</p>
<p>In our dataset, the distance column is represented by m (meters). For convinience, we will convert this into km (kilometers)</p>
<p>We will need to apply <b>data transformation</b> to distance(m) into distance(km)</p>


In [None]:
# Convert distance(m) into distance(km) dividing by 1000
df_new['distance (m)'] = (df_new['distance (m)']/1000).round(1)

# rename the column
df_new.rename(columns = {'distance (m)': 'distance (km)'}, inplace = True)

# check the transformed data 
df_new.head()

**There is no other features that need standardization**

### 6.4) Data Normalization

<b>Why normalization?</b>
<p>Normalization is the process of transforming values of several variables into a similar range. Typical normalizations include scaling the variable so the variable average is 0, scaling the variable so the variance is 1, or scaling variable so the variable values range from 0 to 1
</p>

<p>Let's normalize the columns "distance (km)", "latitute" and "longitute" </p>
<p><b>Target:</b>would like to Normalize those variables so their value ranges from 0 to 1.</p>
<p><b>Approach:</b> use MinMaxScaler</p>

In [None]:
# sclae 'distance (km)', 'latitute' and 'longitute' columns

from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
k = df_new[['distance (km)', 'latitute', 'longitute']]
df_new[['distance (km)', 'latitute', 'longitute']] = scaler.fit_transform(k)

In [None]:
# convert 'city' and 'category_name' into numerical values using LabelEncoder

from sklearn.preprocessing import LabelEncoder
LE = LabelEncoder()
df_new['city_#'] = LE.fit_transform(df_new['city'])
df_new['category_name_#'] = LE.fit_transform(df_new['category_name'])

In [None]:
# normalize these new columns: Replace original value by (original value)/(maximum value)
df_new['city_#'] = df_new['city_#']/df_new['city_#'].max()
df_new['category_name_#'] = df_new['category_name_#']/df_new['category_name_#'].max()

df_new.head()

### 6.5) Binning

<b>Why binning?</b>
<p>
    Binning is a process of transforming continuous numerical variables into discrete categorical 'bins', for grouped analysis.
</p>

<b>Example: </b>
<p>In our dataset, "distance (km)" is a real valued variable ranging from 0.9 to 127.5, it has 159 unique values. What if we only care about the distances like close, medium and far (3 types)? Can we rearrange them into three ‘bins' to simplify analysis? </p>

<p>We will use the Pandas method 'cut' to segment the 'distance (km)' column into 3 bins </p>



In [None]:
bins = np.linspace(min(df_new["distance (km)"]), max(df_new["distance (km)"]), 4)
group_names = ['Close', 'Medium', 'Far']
df_new['distance'] = pd.cut(df_new['distance (km)'], bins, labels=group_names, include_lowest=True )
df_new.head()

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
plt.bar(group_names, df_new["distance"].value_counts())

# set x/y labels and plot title
plt.xlabel("distance")
plt.ylabel("count")
plt.title("distance bins")
plt.show()

### 6.6) Indicator variable (or dummy variable)

<b>What is an indicator variable?</b>
<p>
    An indicator variable (or dummy variable) is a numerical variable used to label categories. They are called 'dummies' because the numbers themselves don't have inherent meaning. 
</p>

<b>Why we use indicator variables?</b>
<p>
    So we can use categorical variables for machine learning algorithms later in this project.
</p>

<b>Example</b>
<p>
    We see the column "country_code" has three unique values, "DE", "NL" or "BE". Most of the algorithms do not understand words, only numbers. To use this attribute in analysis, we convert "country_code" into indicator variables.
</p>

<p>
    We will use the panda's method 'get_dummies' to assign numerical values to different columns shown below. 
</p>

<p>
    "country_code", "state", "referralId", "label", "type", "distance" 

In [None]:
dummy_variable_country_code = pd.get_dummies(df_new["country_code"])
dummy_variable_state = pd.get_dummies(df_new["state"])
dummy_variable_referralId = pd.get_dummies(df_new["referralId"])
dummy_variable_label = pd.get_dummies(df_new["label"])
dummy_variable_type = pd.get_dummies(df_new["type"])
dummy_variable_distance = pd.get_dummies(df_new["distance"])

In [None]:
# merge dataframe "df_new" and dummy_variable dataframes
df_dummy = pd.concat([df_new, dummy_variable_country_code, dummy_variable_state, dummy_variable_referralId, dummy_variable_label, dummy_variable_type, dummy_variable_distance], axis = 1)

In [None]:
# drop the original columns
df_dummy.drop(['country_code', 'state', 'referralId', 'label', 'type', 'distance'], axis = 1, inplace = True)
df_dummy.head(3)

In [None]:
# lets also drop some of the unnecessary columns 
df_dummy.drop(['id', 'city', 'category_name', 'country', 'adress', 'name'], axis = 1, inplace = True)

In [None]:
df_dummy.head(2)

**change distance (km) name into distance_norm**

In [None]:
df_dummy.rename(columns = {'distance (km)': 'distance_norm'}, inplace = True)
df_dummy.head(2)

## 7) Exploratory Data Analysis
* 7.1 Analyzing Individual Feature Patterns using Visualization
* 7.2 Descriptive Statistical Analysis
* 7.3 Basics of Grouping
* 7.4 Correlation and Causation

## 7.1) Analyzing Individual Feature Patterns using Visualization

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline 

<h4>How to choose the right visualization method?</h4>
<p>When visualizing individual variables, it is important to first understand what type of variable you are dealing with. This will help us find the right visualization method for that variable.</p>

for example, we can calculate the correlation between variables  of type "int64" or "float64" using the method "corr":

In [None]:
df_new.dtypes

<h2>7.1.1) Continuous numerical variables:</h2> 

<p>Continuous numerical variables are variables that may contain any value within some range. Continuous numerical variables can have the type "int64" or "float64". A great way to visualize these variables is by using scatterplots with fitted lines.</p>

<p>We can use "regplot", which plots the scatterplot plus the fitted regression line for the data.</p>

In [None]:
df_new.corr()

**Letw plot heat map of the correlations**

In [None]:
sns.heatmap(df_new.corr(), annot=True)
plt.show()

**Here we see correlation between latitue and distance (km)**

* The latitude is the ***north or south*** distance of a point on the earth's surface from the equator

* The longitude describes one of the two coordinates of a place on the earth's surface, namely its position ***east or west*** of a defined north-south line, the prime meridian

In [None]:
# latitude as potential segmenter 
from scipy import stats
a = df_new['latitute']
b = df_new['distance (km)']
print('p values: ', stats.pearsonr(a, b)[1])
sns.regplot(x="latitute", y="distance (km)", data=df_new)
print('Correlation: ', df_new[['latitute', 'distance (km)']].corr().iloc[0,1])
plt.show()


In [None]:
# longitute as a potential segmenter 
c = df_new['longitute']
b = df_new['distance (km)']
print('p values: ', stats.pearsonr(c, b)[1])
sns.regplot(x="longitute", y="distance (km)", data=df_new)
print('Correlation: ', df_new[['longitute', 'distance (km)']].corr().iloc[0,1])
plt.show()

* We understand from above plots that there is a positive correlation between latitude vs distance and negative correlation between longitute vs distance (since p values for both cases are much low, that is p<0.001)
* Latitue: North or south distance of a point
* We can conclude from here that the venues are mostly distributed in the North - South line 

In [None]:
sns.lmplot(x="latitute", y="distance (km)", hue = 'type', data=df_new)
plt.show()

**We see that the distribution of business and target market venues are almost paralel and in the N-S direction. Actually, when we 
examine carefully the map depicted above, they are mostly distributed on the North of Aachen. Thus, latitute column will be a good indicator**

<h3>7.1.2) Categorical variables</h3>

<p>These are variables that describe a 'characteristic' of a data unit, and are selected from a small group of categories. The categorical variables can have the type "object" or "int64". A good way to visualize categorical variables is by using boxplots.</p>

In [None]:
# lets examine country_code and distance (km) 
sns.boxplot(x="country_code", y="distance (km)", data=df_new)
plt.show()

The distribution of distance between countries have a significant overlap, it does not give us much information. But we can say that the distribution of the venues in Germany is more in bulk  

In [None]:
# lets examine referralId and distance (km)
# we see that referralId could distinguish a little bit the distance of venues
plt.figure(figsize = (9,5))
sns.boxplot(x="referralId", y="distance (km)", data=df_new)
plt.show()

In [None]:
# lets examine label and distance (km)
# we see that label may distinguish the distance of venues
plt.figure(figsize = (15,7))
sns.boxplot(x="label", y="distance (km)", data=df_new)
plt.show()

In [None]:
# lets examine type and distance (km)
# we see that type may distinguish the distance of venues
# business venues are a little farther from Aachen as compared to the target market
sns.boxplot(x="type", y="distance (km)", data=df_new)
plt.show()

### 7.2) Descriptive Statistical Analysis

<p>Let's first take a look at the variables by utilizing a description method.</p>

<p>The <b>describe</b> function automatically computes basic statistics for all continuous variables. Any NaN values are automatically skipped in these statistics.</p>

This will show:
<ul>
    <li>the count of that variable</li>
    <li>the mean</li>
    <li>the standard deviation (std)</li> 
    <li>the minimum value</li>
    <li>the IQR (Interquartile Range: 25%, 50% and 75%)</li>
    <li>the maximum value</li>
<ul>


In [None]:
df_new.describe()

In [None]:
# We see that in the category_name column there are outliers
df_new.boxplot()
plt.show()

**Value Counts**

In [None]:
# referralId
df_referralId = df_new.referralId.value_counts().to_frame()
df_referralId.rename(columns = {'referralId': 'Count'}, inplace = True)
df_referralId.index.name = 'referralId'
df_referralId

In [None]:
# referralId bar plot
df_referralId = df_new.referralId.value_counts().to_frame()
sns.barplot(x=df_referralId.index, y="referralId", data=df_referralId)
plt.xlabel('referralId')
plt.ylabel('Count')
plt.show()

In [None]:
# country_code
df_country_code = df_new.country_code.value_counts().to_frame()
sns.barplot(x=df_country_code.index, y="country_code", data=df_country_code)
plt.xlabel('Country')
plt.ylabel('Count')
plt.show()

In [None]:
# city
df_city = df_new.city.value_counts().to_frame()
plt.figure(figsize = (18,8))
sns.barplot(x=df_city.index, y="city", data=df_city)
plt.xticks(rotation = 90)
plt.ylabel('Count')
plt.show()

In [None]:
# label
df_label = df_new.label.value_counts().to_frame()
plt.figure(figsize = (18,8))
plt.xticks(rotation = 90)
sns.barplot(x=df_label.index, y="label", data=df_label)
plt.ylabel('Count')
plt.show()

In [None]:
# type
df_type = df_new.type.value_counts().to_frame()
sns.barplot(x=df_type.index, y="type", data=df_type)
plt.xticks(rotation = 90)
plt.ylabel('Count')
plt.show()

### 7.3) Grouping

<p>The "groupby" method groups data by different categories. The data is grouped based on one or several variables and analysis is performed on the individual groups.</p>

### 7.3.1) Group by country & type and observe the distance

In [None]:
# grouping results
df_group = df_new[['referralId','country_code', 'city', 'label', 'type', 'distance (km)']]
grouped_test1 = df_group.groupby(['country_code','type'],as_index=False).mean()
grouped_test1

<p>This grouped data is much easier to visualize when it is made into a pivot table. A pivot table is like an Excel spreadsheet, with one variable along the column and another along the row. We can convert the dataframe to a pivot table using the method "pivot " to create a pivot table from the groups.</p>

<p>In this case, we will leave the "country_code" variable as the rows of the table, and pivot "type" to construct the columns of the table:</p>

In [None]:
grouped_pivot = grouped_test1.pivot(index='country_code', columns='type')
grouped_pivot

Let's use a heat map to visualize the relationship between country_code vs distance.

In [None]:
#use the grouped results
fig, ax = plt.subplots()
im = ax.pcolor(grouped_pivot, cmap='RdBu')

#label names
row_labels = grouped_pivot.columns.levels[1]
col_labels = grouped_pivot.index

#move ticks and labels to the center
ax.set_xticks(np.arange(grouped_pivot.shape[1]) + 0.5, minor=False)
ax.set_yticks(np.arange(grouped_pivot.shape[0]) + 0.5, minor=False)

#insert labels
ax.set_xticklabels(row_labels, minor=False)
ax.set_yticklabels(col_labels, minor=False)

#rotate label if too long
plt.xticks(rotation=0)

fig.colorbar(im)
plt.show()

* Target market venues in Belgium are the closest ones to Aachen. On the other hand, business venues in Belgium are the fathest
* In Germany, both the business and target market venues are almost at the same distance
* In Netherlands, target market venues are closer as compared to the business venues

**We can say that it might make sense to open a business in Belgium both near to Netherlands and Aachen. But we need further analysis**

### 7.3.2) Group by country & label and observe the distance

In [None]:
grouped_test2 = df_group.groupby(['country_code','label'],as_index=False).mean()
grouped_pivot = grouped_test2.pivot(index='country_code', columns='label')

#use the grouped results
fig, ax = plt.subplots(figsize=(20, 5))
im = ax.pcolor(grouped_pivot, cmap='RdBu')

#label names
row_labels = grouped_pivot.columns.levels[1]
col_labels = grouped_pivot.index

#move ticks and labels to the center
ax.set_xticks(np.arange(grouped_pivot.shape[1]) + 0.5, minor=False)
ax.set_yticks(np.arange(grouped_pivot.shape[0]) + 0.5, minor=False)

#insert labels
ax.set_xticklabels(row_labels, minor=False)
ax.set_yticklabels(col_labels, minor=False)

#rotate label if too long
plt.xticks(rotation=0)

fig.colorbar(im)
plt.show()

Genetic hospitals and genetic diagnosis in Belgium and life science venues in Netherlands are close to Aachen. These results can affect our decisions deeply.   

In [None]:
# We see that in the category_name_# column there are outliers in the target market
df_new.boxplot(by='type', figsize=(15,5), layout=(1,5))
plt.show()

### 7.4) Correlation and Causation

<p><b>Correlation</b>: a measure of the extent of interdependence between variables.</p>

<p><b>Causation</b>: the relationship between cause and effect between two variables.</p>

<p>It is important to know the difference between these two and that correlation does not imply causation. Determining correlation is much simpler  than determining causation as causation may require independent experimentation.</p>

<p3>Pearson Correlation</p>
<p>The Pearson Correlation measures the linear dependence between two variables X and Y.</p>
<p>The resulting coefficient is a value between -1 and 1 inclusive, where:</p>
<ul>
    <li><b>1</b>: Total positive linear correlation.</li>
    <li><b>0</b>: No linear correlation, the two variables most likely do not affect each other.</li>
    <li><b>-1</b>: Total negative linear correlation.</li>
</ul>

<p>Pearson Correlation is the default method of the function "corr".  Like before we can calculate the Pearson Correlation of the of the 'int64' or 'float64'  variables.</p>

In [None]:
df_new.corr()

 Sometimes we would like to know the significant of the correlation estimate. How?
 
 <b>P-value</b>: 
<p>What is this P-value? The P-value is the probability value that the correlation between these two variables is statistically significant. Normally, we choose a significance level of 0.05, which means that we are 95% confident that the correlation between the variables is significant.</p>

By convention, when the
<ul>
    <li>p-value is $<$ 0.001: we say there is strong evidence that the correlation is significant.</li>
    <li>the p-value is $<$ 0.05: there is moderate evidence that the correlation is significant.</li>
    <li>the p-value is $<$ 0.1: there is weak evidence that the correlation is significant.</li>
    <li>the p-value is $>$ 0.1: there is no evidence that the correlation is significant.</li>
</ul>

In [None]:
from scipy import stats

**7.4.1) Latitute vs Distance**

In [None]:
pearson_coef, p_value = stats.pearsonr(df_new['latitute'], df_new['distance (km)'])
print("The Pearson Correlation Coefficient is", pearson_coef, " with a P-value of P =", p_value) 
print('p value is smaller than 0.001: ', p_value < 0.001)

p_value < 0.001 implies that the correlation is statistically significant, although the linear relationship isn't extremely strong with correlation coefficient = 0.5216026202606111

**7.4.2) Longitute vs Distance**

In [None]:
pearson_coef, p_value = stats.pearsonr(df_new['longitute'], df_new['distance (km)'])
print("The Pearson Correlation Coefficient is", pearson_coef, " with a P-value of P =", p_value) 
print('p value is smaller than 0.001: ', p_value < 0.001)

p_value < 0.001 implies that the correlation is statistically significant, although the linear relationship isn't strong with correlation coefficient = -0.2875019558678858

(In other words, we have a week relation but we can trust this relation since it has very small p value)

The distance of venues from Aachen increases as we go on latitute and slightly decreases as we go on longitute. 

**7.4.3) City_# vs Distance**

In [None]:
pearson_coef, p_value = stats.pearsonr(df_new['city_#'], df_new['distance (km)'])
print("The Pearson Correlation Coefficient is", pearson_coef, " with a P-value of P =", p_value) 
print('p value is higher than 0.1: ', p_value > 0.1)

No relation between city_# and distance

### 7.5) ANOVA: Analysis of Variance

<p>The Analysis of Variance  (ANOVA) is a statistical method used to test whether there are significant differences between the means of two or more groups. ANOVA returns two parameters:</p>

<p><b>F-test score</b>: ANOVA assumes the means of all groups are the same, calculates how much the actual means deviate from the assumption, and reports it as the F-test score. A larger score means there is a larger difference between the means.</p>

<p><b>P-value</b>:  P-value tells how statistically significant is our calculated score value.</p>

<p>If a variable is strongly correlated with another variable that we are analyzing, expect ANOVA to return a sizeable F-test score and a small p-value.</p>

<p>Since ANOVA analyzes the difference between different groups of the same variable, the groupby function will come in handy. Because the ANOVA algorithm averages the data automatically, we do not need to take the average before hand.</p>

### 7.5.1) referralId vs distance

In [None]:
df_categorical = df_new[['referralId', 'country_code', 'type', 'label', 'distance (km)']]
df_categorical_grouped1 = df_categorical[['referralId', 'distance (km)']].groupby(['referralId']) 
df_categorical_grouped1.head(2)

In [None]:
refID = []
num_of_refID = len(df_categorical.referralId.value_counts().index)
for i in range (0,num_of_refID):
    refID.append(df_categorical.referralId.value_counts().index[i])

# ANOVA of all groups
f_val, p_val = stats.f_oneway(df_categorical_grouped1.get_group(refID[0])['distance (km)'], df_categorical_grouped1.get_group(refID[1])['distance (km)'], df_categorical_grouped1.get_group(refID[2])['distance (km)'], df_categorical_grouped1.get_group(refID[3])['distance (km)'])  
print("ANOVA results for all groups in referralId: F=", f_val, ", P =", p_val)

for i in range (0, num_of_refID):
    for j in range (i+1, num_of_refID):
        f_val, p_val = stats.f_oneway(df_categorical_grouped1.get_group(refID[i])['distance (km)'], df_categorical_grouped1.get_group(refID[j])['distance (km)'])  
        print("ANOVA results for referralId {0} and {1}: F= {2}, P = {3}".format(refID[i], refID[j], f_val, p_val))

For high p_value (> 0.1) and low F score, we can not say a statistically significant results
But for other referralIds, we can say that there is a differance and they are statistically significant (p_values are low (<0.001) and F scores are high) 

### 7.5.2) Country vs Distance

In [None]:
df_categorical_grouped2 = df_categorical[['country_code', 'distance (km)']].groupby(['country_code']) 
df_categorical_grouped2.head(2)

In [None]:
# ANOVA of all groups
f_val, p_val = stats.f_oneway(df_categorical_grouped2.get_group('NL')['distance (km)'], df_categorical_grouped2.get_group('BE')['distance (km)'], df_categorical_grouped2.get_group('DE')['distance (km)'])  
print( "ANOVA results for all groups in country_code: F=", f_val, ", P =", p_val)

# ANOVA of group with country_code NL and BE 
f_val, p_val = stats.f_oneway(df_categorical_grouped2.get_group('NL')['distance (km)'], df_categorical_grouped2.get_group('BE')['distance (km)'])  
print( "ANOVA results for NL and BE: F=", f_val, ", P =", p_val) 

# ANOVA of group with country_code NL and DE 
f_val, p_val = stats.f_oneway(df_categorical_grouped2.get_group('NL')['distance (km)'], df_categorical_grouped2.get_group('DE')['distance (km)'])  
print( "ANOVA results for NL and DE: F=", f_val, ", P =", p_val)   

# ANOVA of group with country_code BE and DE 
f_val, p_val = stats.f_oneway(df_categorical_grouped2.get_group('BE')['distance (km)'], df_categorical_grouped2.get_group('DE')['distance (km)'])  
print( "ANOVA results for BE and DE: F=", f_val, ", P =", p_val)

We can mention that BE and DE have slight differance and it may be statistically significant. But no relation for NL and BE

### 7.5.3) Type vs Distance

In [None]:
df_categorical_grouped3 = df_categorical[['type', 'distance (km)']].groupby(['type']) 
df_categorical_grouped3.head(2)

In [None]:
# ANOVA of all groups
f_val, p_val = stats.f_oneway(df_categorical_grouped3.get_group('business')['distance (km)'], df_categorical_grouped3.get_group('target market')['distance (km)'])  
print( "ANOVA results for all groups in type: F=", f_val, ", P =", p_val)

We can mention that business and target market are different groups

### 8) Model Development

We want to predict the possible location of the venue to be build up. To achieve this goal, we need to predict latitute and longitute, which are continous variables. Target feature is not a labeled data, so we can not use supervised learning algorithms. For continous variables we can choose regression models such as linear regression and polynomial regression models. We are also going to build unsupervised learning models to seek any reasonable relationship   

In [None]:
df_dummy.head(2)

In [None]:
x_lat = df_dummy.drop(['latitute'], axis = 1)
y_lat = df_dummy['latitute']

x_lon = df_dummy.drop(['longitute'], axis = 1)
y_lon = df_dummy['longitute']

from sklearn.model_selection import train_test_split
x_train_lat, x_test_lat, y_train_lat, y_test_lat = train_test_split(x_lat, y_lat, test_size=0.15, random_state=1)
x_train_lon, x_test_lon, y_train_lon, y_test_lon = train_test_split(x_lon, y_lon, test_size=0.15, random_state=1)

print("number of test samples :", x_test_lat.shape[0])
print("number of training samples:",x_train_lat.shape[0])

### 8.1) Linear Regression

In [None]:
from sklearn.linear_model import LinearRegression
lr_lat = LinearRegression()
lr_lat.fit(x_train_lat, y_train_lat)

print('Linear Regression for Latitute Prediction')
print('train data prediction score: ',lr_lat.score(x_train_lat, y_train_lat))
print('test data prediction score: ',lr_lat.score(x_test_lat, y_test_lat))

lr_lon = LinearRegression()
lr_lon.fit(x_train_lon, y_train_lon)

print()
print('Linear Regression for Longitute Prediction')
print('train data prediction score: ',lr_lon.score(x_train_lon, y_train_lon))
print('test data prediction score: ',lr_lon.score(x_test_lon, y_test_lon))

Train data prediction score is high compared to the test data prediction score. This is actually expected result. Nevertheless, the results show that the model makes good predicitons

**Cross-validation Score**

In [None]:
# The parameter 'cv' determines the number of folds. 
# We divide data into cv = 4 parts and use 1 portion for testing and 3 portion for training. 
# Then we change the order and perform the procedure again 
from sklearn.model_selection import cross_val_score
Rcross_lat = abs(cross_val_score(lr_lat, x_lat, y_lat, cv=4))
Rcross_lon = abs(cross_val_score(lr_lon, x_lon, y_lon, cv=4))

print('Cross validation scores for latitute: ', Rcross_lat)
print('Cross validation scores for longitute: ', Rcross_lon)
print("")
print("Latitute: The mean of the folds: ", Rcross_lat.mean(), "and the standard deviation: " , Rcross_lat.std())
print("Longitute: The mean of the folds: ", Rcross_lon.mean(), "and the standard deviation: " , Rcross_lon.std())

- We see that when we change the order of input data, predictions for latitute and for longitute slightly decreases
- But for longitute, it is more reliable, mean is higher and standart deviation is lower. 

### 8.2) Polynomial Regression and Pipelines

<p>Data Pipelines simplify the steps of processing the data. We use the module <b>Pipeline</b> to create a pipeline. We also use <b>StandardScaler</b> as a step in our pipeline.</p>

We create the pipeline, by creating a list of tuples including the name of the model or estimator and its corresponding constructor.

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import r2_score

for i in range (1, 5):
    Input=[('scale',StandardScaler()), ('polynomial', PolynomialFeatures(degree=i)), ('model',LinearRegression())]
    pipe=Pipeline(Input)

    pipe_lat = pipe.fit(x_train_lat, y_train_lat)
    y_train_lat_hat = pipe_lat.predict(x_train_lat)
    y_test_lat_hat = pipe_lat.predict(x_test_lat)

    pipe_lon = pipe.fit(x_train_lon, y_train_lon)
    y_train_lon_hat = pipe_lon.predict(x_train_lon)
    y_test_lon_hat = pipe_lon.predict(x_test_lon)
    
    print('Degree: ', i)
    print('')
    print('Polynomial Regression for Latitute Prediction')
    print('train data prediction score: ',r2_score(y_train_lat, y_train_lat_hat))
    print('test data prediction score: ',r2_score(y_test_lat, y_test_lat_hat))

    print()
    print('Polynomial Regression for Longitute Prediction')
    print('train data prediction score: ',r2_score(y_train_lon, y_train_lon_hat))
    print('test data prediction score: ',r2_score(y_test_lon, y_test_lon_hat))
    print()

**For degree = 1, prediction scores are good. But for degree > 1, we see that prediction score of train data is very high, however, prediction score of test data is very low. This implies that our model is overfitted, it does not give nice results for degree > 1**

### 8.3) K-Means Clustering

The KMeans class has many parameters that can be used, but we will be using these three:
<ul>
    <li> <b>init</b>: Initialization method of the centroids. </li>
    <ul>
        <li> Value will be: "k-means++" </li>
        <li> k-means++: Selects initial cluster centers for k-mean clustering in a smart way to speed up convergence.</li>
    </ul>
    <li> <b>n_clusters</b>: The number of clusters to form as well as the number of centroids to generate. </li>
    <li> <b>n_init</b>: Number of time the k-means algorithm will be run with different centroid seeds. The final results will be the best output of n_init consecutive runs in terms of inertia. </li>
    <ul> <li> Value will be: 12 </li> </ul>
</ul>

Initialize KMeans with these parameters, where the output parameter is called <b>k_means</b>.

In [None]:
# we want to cluster the data into 4 groups
X = df_dummy.values
from sklearn.cluster import KMeans 
k_means = KMeans(init = "k-means++", n_clusters = 4, n_init = 12)
k_means.fit(X)
k_means_labels = k_means.labels_

In [None]:
df_kmeans = pd.concat([df_all, pd.DataFrame(k_means_labels, columns = ['kmeans_label'])], axis = 1)
df_kmeans.head(2)

In [None]:
map_kmeans = folium.Map(location=[latitude, longitude], zoom_start=7) # generate map centred around Aachen

# add the group venues that are nearby as blue circle markers
for lat, lng, label in zip(df_kmeans['latitute'], df_kmeans['longitute'], df_kmeans['kmeans_label']):
    
    if label == 0.0:
        label_color = 'red'
    elif label == 1.0:
        label_color = 'green'
    elif label == 2.0:
        label_color = 'blue'
    else:
        label_color = 'purple'
    
    folium.vector_layers.CircleMarker(
        [lat, lng],
        radius=5,
        color=label_color,
        popup=label,
        fill = True,
        fill_color=label_color,
        fill_opacity=0.6
    ).add_to(map_kmeans)
    
map_kmeans

In [None]:
df_kmeans_0 = df_kmeans[df_kmeans.kmeans_label == 0.0][['name', 'country_code', 'city', 'label', 'type']]
df_kmeans_1 = df_kmeans[df_kmeans.kmeans_label == 1.0][['name', 'country_code', 'city', 'label', 'type']]
df_kmeans_2 = df_kmeans[df_kmeans.kmeans_label == 2.0][['name', 'country_code', 'city', 'label', 'type']]
df_kmeans_3 = df_kmeans[df_kmeans.kmeans_label == 3.0][['name', 'country_code', 'city', 'label', 'type']]

In [None]:
df_kmeans_0

In [None]:
df_kmeans_1

In [None]:
df_kmeans_2

In [None]:
df_kmeans_3

### 9. Results
- There are only 4 business venues within 50km
- We understand from above plots that there is a positive correlation between latitude vs distance and negative correlation between longitute vs distance (since p values for both cases are much low, that is p<0.001)
- The distribution of distance between countries have a significant overlap, it does not give us much information. But we can say that the distribution of the venues in Germany is more in bulk
- Target market venues in Belgium are the closest ones to Aachen. On the other hand, business venues in Belgium are the fathest
- In Germany, both the business and target market venues are almost at the same distance
- In Netherlands, target market venues are closer as compared to the business venues
- We can mention that BE and DE have slight differance and it may be statistically significant. But no relation for NL and BE
- We can mention that business and target market are different groups

### 10. Discussion

- We see that the distribution of business and target market venues are almost paralel and in the N-S direction. Actually, when we examine carefully the map depicted above, they are mostly distributed on the North of Aachen. Thus, latitute column will be a good indicator. Moreover, business venues are farther than target market venues
- Genetic hospitals and genetic diagnosis in Belgium and life science venues in Netherlands are close to Aachen. So, the target is this venues we should take these results into accout
- We want to setup a business. The results show that, the venue should be in one of the df_kmeans#. Interestingly, almost all business venues are in the same group. This shows business venues are smartly distributed and we should also consider this when choosing the location

### 11. Conclusion
- We can conclude from here that the venues are mostly distributed in the North - South line. There is a positive correlation with latitute and negative correlation with longitute.  
- We can say that it might make sense to open a business in Belgium both near to Netherlands and Aachen. Specifically, if the target is genetic hospitals and genetic diagnosis centers and life science venues. 
- We can say that the distribution of the venues in Germany is more in bulk. This shows that, it may not be a good choise to invest in Germany especially if you are a startup company. 