# From Emails

### 1. Briefly introduce your topic - why should we care about the topic? What datasets you will use + where you'll get these? Possible methodology? 

**Topic**
The analysis of characters of moved companies in UK tech industries. 
characters might involve:
- Employee size
- Location
- Company operating years

**Reason**
For Government: 
- Attract talented people to work in UK
- Improve employment
    
For Companies: 
- Where the founder choose to start up companies
- Optimise the recruitment
     
**Method**

ML: DB-SCAN OR K-means? -> inspect the characters of moved companies

Output: Insights or Recommendation for policy

### 2. Think about what you want to do with the dissertation afterwards - is it something to show future employers, do you want to do further academic research? 

- Manipulate the big data -> increase ability in Data processing

- Visualisation: Moved firm's Origin/Destination Map -> employers -> provide business insights

- Further research: Decision Tree/Random Forest -> predict which companies may move

### 3. Ethical reflection - have a look at the ethics lecture / guidance and draft a couple of sentences. Will your project need formal ethical approval or not?

N/A

# Data Overview

* FAME_OC: master data, all variables [~15gb] 
* OC_1.1: extract of firms with true location info (trading addresses) [<1gb]
* OC_2.1: extract of firms with different trading addresses and registered addresses (may be trading address or home/other location) [<1gb]
* OC_3.1: extract of firms with only registered addresses [~10gb]

---

In [1]:
# 329MB
import pandas as pd
import sys
PATH = sys.path[0]
f=r'/Users/fangzeqiang/Desktop/Dissertation/OC_1.1.dta'
OC_1_1=pd.read_stata(f)

# EDA

In [2]:
OC_1_1.head()

Unnamed: 0,registered_number,bvd_id,id,registered_addresspostal_code,primaryaddresspostcodeifuk,alltradingaddressespostcodeifuk,sic4,birth_year,diss_year,streetnobuildingetcline1,...,country,countryisocode,regionincountry,typeofregionincountry,telephonenumber,faxnumber,addresstype,address_group,isdup_t_postcode,_merge
0,118,,233927.0,TN23 1DA,,,4611.0,1856.0,,The New Ashford Market,...,United Kingdom,GB,England|South Eastern|Tonbridge (TN)|Ashford,Country|Region|Postal area|Town,1233506201.0,,Trading address,1.0,0,matched (3)
1,258,,12022629.0,RG12 1AN,,,7499.0,1856.0,,India Buildings,...,United Kingdom,GB,England|North West|Liverpool (L)|Liverpool,Country|Region|Postal area|Town,,,Trading address,1.0,0,matched (3)
2,371,,2111477.0,SW1Y 6BN,,,7499.0,1863.0,,St Clements House,...,United Kingdom,GB,England|Eastern|Norwich (NR)|Norwich,Country|Region|Postal area|Town,,,Trading address,1.0,0,matched (3)
3,402,,2302906.0,PL4 0RA,,,7499.0,1863.0,,Millbay Road,...,United Kingdom,GB,England|South Western|Plymouth (PL)|Plymouth,Country|Region|Postal area|Town,1752275850.0,,Trading address,1.0,0,matched (3)
4,425,,3310625.0,EC4Y 8BB,,,,1856.0,2014.0,Surrey House,...,United Kingdom,GB,England|London Outer|Kingston Upon Thames (KT)...,Country|Region|Postal area|Town,,,Trading address,1.0,0,matched (3)


### Inspect all columns' name

In [3]:
OC_1_1.columns

Index(['registered_number', 'bvd_id', 'id', 'registered_addresspostal_code',
       'primaryaddresspostcodeifuk', 'alltradingaddressespostcodeifuk', 'sic4',
       'birth_year', 'diss_year', 'streetnobuildingetcline1',
       'streetnobuildingetcline1native', 'streetnobuildingetcline2',
       'streetnobuildingetcline2native', 'streetnobuildingetcline3',
       'streetnobuildingetcline3native', 'streetnobuildingetcline4',
       'streetnobuildingetcline4native', 'postcode', 'city', 'citynative',
       'country', 'countryisocode', 'regionincountry', 'typeofregionincountry',
       'telephonenumber', 'faxnumber', 'addresstype', 'address_group',
       'isdup_t_postcode', '_merge'],
      dtype='object')

### Inspect the numeric columns

In [4]:
df = OC_1_1.copy()

import numpy as np
df.describe(include = [np.number])

Unnamed: 0,id,sic4,birth_year,diss_year,address_group,isdup_t_postcode
count,254769.0,213596.0,254737.0,69005.0,259581.0,259581.0
mean,21215760.0,6399.244466,2000.244535,2014.631882,1.0,0.000593
std,38744940.0,2177.294293,15.828067,11.03944,0.0,0.16216
min,63.0,111.0,1856.0,8.0,1.0,0.0
25%,1276526.0,4648.0,1995.0,2013.0,1.0,0.0
50%,2615181.0,6810.0,2004.0,2015.0,1.0,0.0
75%,9002524.0,8299.0,2011.0,2017.0,1.0,0.0
max,168733800.0,9999.0,2018.0,2018.0,1.0,78.0


### Inspect the data types

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 259581 entries, 0 to 259580
Data columns (total 30 columns):
 #   Column                           Non-Null Count   Dtype   
---  ------                           --------------   -----   
 0   registered_number                259581 non-null  object  
 1   bvd_id                           259581 non-null  object  
 2   id                               254769 non-null  float64 
 3   registered_addresspostal_code    259581 non-null  object  
 4   primaryaddresspostcodeifuk       259581 non-null  object  
 5   alltradingaddressespostcodeifuk  259581 non-null  object  
 6   sic4                             213596 non-null  float64 
 7   birth_year                       254737 non-null  float64 
 8   diss_year                        69005 non-null   float64 
 9   streetnobuildingetcline1         259581 non-null  object  
 10  streetnobuildingetcline1native   259581 non-null  object  
 11  streetnobuildingetcline2         259581 non-null  ob

### Inspect the missing data

In [20]:
total = df.isnull().sum().sort_values(ascending=False)
percent = (df.isnull().sum()/df.isnull().count()).sort_values(ascending=False)
missing_data = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
print(missing_data)

                                  Total   Percent
diss_year                        190576  0.734168
sic4                              45985  0.177151
birth_year                         4844  0.018661
id                                 4812  0.018538
_merge                                0  0.000000
isdup_t_postcode                      0  0.000000
bvd_id                                0  0.000000
registered_addresspostal_code         0  0.000000
primaryaddresspostcodeifuk            0  0.000000
alltradingaddressespostcodeifuk       0  0.000000
streetnobuildingetcline1              0  0.000000
streetnobuildingetcline1native        0  0.000000
streetnobuildingetcline2              0  0.000000
streetnobuildingetcline2native        0  0.000000
streetnobuildingetcline3              0  0.000000
streetnobuildingetcline3native        0  0.000000
streetnobuildingetcline4              0  0.000000
streetnobuildingetcline4native        0  0.000000
postcode                              0  0.000000


## Questions
- what is `sic4`?
- what is `dss_year`?
- The valuable attributes are `city`, `country` sth. like addresstype thing?

### Inspect the unique values for all columns

In [21]:
# apply the unique() method to inspect data
for i in df:
    print( str(i) + "\n")
    print(df[str(i)].unique())
    print("------------------------------------------------------------------------------------\n")

registered_number

['00000118' '00000258' '00000371' ... 'ZC000150' 'ZC000164' 'ZC000169']
------------------------------------------------------------------------------------

bvd_id

['' 'GB00002065' 'GB00006480' ... 'GBSO303717' 'GBZC000150' 'GBZC000164']
------------------------------------------------------------------------------------

id

[  233927. 12022629.  2111477. ...  2728817.   473940.  2591699.]
------------------------------------------------------------------------------------

registered_addresspostal_code

['TN23 1DA' 'RG12 1AN' 'SW1Y 6BN' ... 'G2 5AB' 'EH2 1DG' 'WC2B 5RR']
------------------------------------------------------------------------------------

primaryaddresspostcodeifuk

['' 'CB6 1RA' 'PL25 4BY' ... 'EH2 2ER' 'EH6 8QP' 'EH4 3LU']
------------------------------------------------------------------------------------

alltradingaddressespostcodeifuk

['' 'CB6 1RA, CB8 8QT, IP12 1PN, NR17 2QZ, NR21 9NH'
 'PL25 4BY, BA1 1SX, BA1 2AP, BA1 2JL, BL3 3QE' ... '

In [19]:
# export the CSV file
df.to_csv("/Users/fangzeqiang/Desktop/OC_1_1.csv")