In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats

# Number of chargers analysis

In [2]:
charger = pd.read_csv('data_charger_energy_EVs.csv')
print(charger.head)


# Drop rows with NaN values
charger = charger.dropna()

#Calculate the z-scores for the 'number_of_charger' column:
charger['z_score'] = np.abs(stats.zscore(charger['number_of_charger']))


#Define a threshold to identify outliers
threshold = 3

#Identify the outliers by filtering the rows with z-scores greater than the threshold:
outliers_charger = charger[charger['z_score'] > threshold]


# Extract the PC6 values of the outliers
pc6_values = outliers_charger['PC6'].tolist()
print(pc6_values)

#put the identified outliers in number of chargers and its corresponding postcode in a new dataframe

outliers_charger_view = pd.DataFrame({'PC6': outliers_charger['PC6'], 'Number_of_chargers': outliers_charger['number_of_charger']})

outliers_charger_view.head(100)


<bound method NDFrame.head of           PC6  number_of_charger  total_capacity  electricity_household  \
0      1003AE                1.0            22.0                    NaN   
1      1011AB                2.0            22.0                    0.0   
2      1011AC                1.0            11.0                 2880.0   
3      1011BG                1.0            22.0                    0.0   
4      1011CA                1.0            11.0                 1850.0   
...       ...                ...             ...                    ...   
19006  1114BC                NaN             NaN                    0.0   
19007  1114BD                NaN             NaN                    0.0   
19008  1114BE                NaN             NaN                    0.0   
19009  1114BG                NaN             NaN                    0.0   
19010  1114BH                NaN             NaN                    0.0   

       electricity_company  December 2017_EV  December 2018_EV  \
0  

Unnamed: 0,PC6,Number_of_chargers
44,1012TL,3.0
59,1013GM,3.0
77,1013NJ,3.0
105,1014AK,5.0
114,1014ZC,3.0
...,...,...
2544,1102AZ,3.0
2577,1103AC,3.0
2603,1103TV,4.0
2606,1104EA,4.0


Possible reasoning for outliers in number of chargers:

- Most of the detected outliers has the number of chargers 3 or 4. However it is also shown that the postcode 1059CM has the highest number of chargers (16 chargers). The area 1095MD also has relatively more chargers (6 chargers).
- 1059CM area is located in the west of Amsterdam. The neighborhood and nearby area consists of business centers, commerical centers, events venues, and recreational facilities, such as Club Atelier or Wicked Grounds. This may explain why this area has more number of chargers. 
- 1095MD area is located in the east of Amsterdam. The neighborhood is an new and more modern area on an island, surrounded by waters, harbors, and canals. This is a newly-developed area with increasing potential of residency; this may help explain why this area has more number of chargers. 



# Electricity (household/company) analysis

Electricity (household):

In [3]:
from scipy import stats

# Calculate the z-scores
charger['z_score_household'] = stats.zscore(charger['electricity_household'])
charger['z_score_company'] = stats.zscore(charger['electricity_company'])

threshold = 3

outliers_household = charger[charger['z_score_household'] > threshold]
outliers_company = charger[charger['z_score_company'] > threshold]
outliers_household.head(10)
outliers_company.head(10)

# Extract the PC6 values of the outliers for electricity_household
pc6_values_household = outliers_household['PC6'].tolist()
print(pc6_values_household)

# Extract the PC6 values of the outliers for electricity_company
pc6_values_company = outliers_company['PC6'].tolist()
print(pc6_values_company)

#put the identified outliers  and its corresponding postcode in a new dataframe

outliers_household_view = pd.DataFrame({'PC6': outliers_household['PC6'], 'electricity_household': outliers_household['electricity_household']})

outliers_household_view.head(100)



['1017KX', '1017XK', '1077AA', '1077BM', '1077GH', '1077GN', '1077GS', '1077GX', '1077LH', '1077RV', '1077VX', '1077WJ', '1081BP', '1087MR']
['1012SZ', '1013BC', '1014AJ', '1014AK', '1014AS', '1014BA', '1018TV', '1031HH', '1042AH', '1043DV', '1059CM', '1067SM', '1072NV', '1076ED', '1077XZ', '1081AP', '1082GC', '1082KR', '1083HJ', '1083HP', '1091GC', '1092AD', '1096CJ', '1097AR', '1097DM', '1101AR', '1105AS']


Unnamed: 0,PC6,electricity_household
242,1017KX,5070.0
263,1017XK,5220.0
1835,1077AA,5190.0
1839,1077BM,5190.0
1860,1077GH,5180.0
1861,1077GN,6140.0
1863,1077GS,4950.0
1864,1077GX,5300.0
1876,1077LH,4910.0
1890,1077RV,5200.0


Possible reasoning for outliers (electricity - household):
- From the outlier analysis, the area 1077 has higher electricity (household). This area is a residential area with lots of residential properties, departments, and green structures such as parks or playgrounds.  There are also highschools and educational properties located in this area. Overall, this is a area suitable for households and famililes to live in. This may help explain why the electricity for household is relatively higher than that in other areas.





Electricity (company):

In [6]:
threshold = 3


outliers_company = charger[charger['z_score_company'] > threshold]

outliers_company.head(10)



# Extract the PC6 values of the outliers for electricity_company
pc6_values_company = outliers_company['PC6'].tolist()
print(pc6_values_company)

#put the identified outliers  and its corresponding postcode in a new dataframe

outliers_company_view = pd.DataFrame({'PC6': outliers_company['PC6'], 'electricity_company': outliers_company['electricity_company']})

outliers_company_view.head(100)

['1012SZ', '1013BC', '1014AJ', '1014AK', '1014AS', '1014BA', '1018TV', '1031HH', '1042AH', '1043DV', '1059CM', '1067SM', '1072NV', '1076ED', '1077XZ', '1081AP', '1082GC', '1082KR', '1083HJ', '1083HP', '1091GC', '1092AD', '1096CJ', '1097AR', '1097DM', '1101AR', '1105AS']


Unnamed: 0,PC6,electricity_company
43,1012SZ,392900.0
53,1013BC,160790.0
104,1014AJ,154750.0
105,1014AK,232070.0
108,1014AS,194350.0
112,1014BA,1167350.0
325,1018TV,222710.0
513,1031HH,234540.0
658,1042AH,947350.0
665,1043DV,264420.0


Possible reasoning for outliers (electricity - company):
- From the outlier analysis, it is shown that the area 1101AR, 1014BA, and 1042AH, for example, has higher electricity (company) are identified as outliers based on above codes.
- The postcode 1101AR area is located in the southeast part of the Amsterdam city. It is an area full of car retailers, warehouses, logistic centers, and car companies. Therefore it is reasonable to assume that it is the reason why this area has higher electricity (company).
- The 1014BA neighborhood is located to the west of the Amsterdam city center, and is nearby to the westport of Amsterdam. Therefore, this area known for its facilities and activities with regard to ports, transportation, and industrial activities. This may help explain why the electricity (company) is relatively higher.
- The area 1042AH is also primarily an industrial area filled with facilities such as warehouses, retail centers, or distribution centers. Therefore it is reasonable to assume that it is the reason why this area has higher electricity (company).

# Household analysis

In [12]:
# Load the data into a pandas DataFrame with correct data types
household = pd.read_csv('data_household.csv', dtype={'Postcode': str, '2017.0': int, '2018.0': int, '2019.0': int})
household.head()

# Calculate the z-scores for each data point
z_scores = (household.iloc[:, 1:] - household.iloc[:, 1:].mean()) / household.iloc[:, 1:].std()

# Set the threshold for outlier detection
threshold = 3

# Detect outliers using the z-scores
outliers = np.where(np.abs(z_scores) > threshold)

# Get the indices of the outliers
outlier_indices = list(zip(outliers[0], outliers[1]))

# Create a new DataFrame to store the outliers
outliers_df = pd.DataFrame(columns=household.columns)

# Iterate over each outlier index and retrieve the corresponding row
for index in outlier_indices:
    outlier_row = household.iloc[index[0]]
    outliers_df = outliers_df.append(outlier_row)

# Reset the index of the outliers DataFrame
outliers_df = outliers_df.reset_index(drop=True)

# Print the DataFrame with the detected outliers
print(outliers_df)



Empty DataFrame
Columns: [Postcode, 2017.0, 2018.0, 2019.0]
Index: []


There are no detected outliers in terms of household in our analysis.

# Population analysis

In [17]:
import pandas as pd

# Load the data
new_age = pd.read_csv('data_population_age.csv', header=[0, 1])

# Flatten the multi-index header
new_age.columns = [' '.join(col).strip() for col in new_age.columns.values]

# Rename the first column to 'Postcode'
new_age.rename(columns={new_age.columns[0]: 'Postcode'}, inplace=True)

# Convert columns to numeric
for col in new_age.columns[1:]:
 new_age[col] = pd.to_numeric(new_age[col], errors='coerce')

# Drop rows with NaN values
new_age = new_age.dropna()
## the rows with NaN values are dropped because they are apparently industrial areas with mostly warehouses, office spaces & businesess




# Calculate Z-scores
columns_to_check_newage = new_age.columns[1:]

z_scores = np.abs(stats.zscore(new_age[columns_to_check_newage]))

# Define a threshold
threshold = 3
outliers_newage = np.where(z_scores > threshold)


# Print outliers
print(f"Outliers are at index positions: {outliers_newage}")





# Create a DataFrame for outliers
outliers_df1 = pd.DataFrame(columns=new_age.columns)
for i in range(len(outliers_newage[0])):
    row_index = outliers_newage[0][i]
    col_index = outliers_newage[1][i]
    outliers_df1 = pd.concat([outliers_df1, new_age.iloc[[row_index], :]])

print(outliers_df1)

Outliers are at index positions: (array([49, 49, 49, 49, 49, 49, 63, 63, 63, 63, 63]), array([ 0,  6, 12, 18, 24, 30,  7, 13, 19, 25, 31]))
   Postcode  2017 0_to_10  2017 10_to_20  2017 20_to_30  2017 30_to_40  \
50     1069        3830.0         2945.0         4610.0         4300.0   
50     1069        3830.0         2945.0         4610.0         4300.0   
50     1069        3830.0         2945.0         4610.0         4300.0   
50     1069        3830.0         2945.0         4610.0         4300.0   
50     1069        3830.0         2945.0         4610.0         4300.0   
50     1069        3830.0         2945.0         4610.0         4300.0   
64     1087        3470.0         2965.0         1680.0         3020.0   
64     1087        3470.0         2965.0         1680.0         3020.0   
64     1087        3470.0         2965.0         1680.0         3020.0   
64     1087        3470.0         2965.0         1680.0         3020.0   
64     1087        3470.0         2965.0      

Possible reasoning for outliers (population):
- According to the outlier analysis, areas such as 1069 and 1087 are identified as outliers in terms of they have relatively higher population (when compared to other areas).
- Area 1069 is a neighborhood located in Amsterdam Nieuw-West, and is an area that provides various residential and recreational spaces. For instance, there are a lot of housing options, supermarkets, shopping centers, parks, restaurants, and cafes located in the area. Moreover, it is very well-connected to the other areas in Amsterdam in terms of public transportation. Hence, this is an area suitable for residence and this may help explain why the population in this area are detected as outliers in out analysis.
- Area 1087 is located in the east part of Amsterdam city. This is a newly-developed and a modern neighborhood, well-known for its location on the island of Ijmeer lake. This area offers lots of schools, recreational venues, hotels, restaurants and cafes, and shopping centers. This may help explain why the population in this area are detected as outliers in out analysis.