<a href="https://colab.research.google.com/github/deliabel/CodeDivisionWorksheets/blob/main/Copy_of_33_Sorting_and_cleaning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Sorting and cleaning
---



In order to effectively analyse a dataset, often we need to prepare it first.
Before a dataset is ready to be analysed we might need to:  

* sort the data (can be a series or dataframe)  
* remove any NaN values or drop NA values   
* remove duplicate records (identical rows)  
* normalise data in dataframe columns so that has a common scale [reference](https://towardsai.net/p/data-science/how-when-and-why-should-you-normalize-standardize-rescale-your-data-3f083def38ff#:~:text=Similarly%2C%20the%20goal%20of%20normalization,dataset%20does%20not%20require%20normalization.&text=So%20we%20normalize%20the%20data,variables%20to%20the%20same%20range.)

## Sorting the data  
---


Typically we want to sort data by the values in one or more columns in the dataframe  

To sort the dataframe by series we use the pandas function **sort_values()**.  

By default `sort_values()` sorts into ascending order.

* sort by a single column e.g.
  * `df.sort_values("Make") `
* sort by multiple columns e.g.
  * `df.sort_values(by = ["Model", "Make"]) `
    * this sorts by Model, then my Make
* sort in *descending* order
  * `df.sort_values(by = "Make", ascending = False)`
  * `df.sort_values(by = ["Make", "Model"], ascending = False])`  

Dataframes are mostly immutable, changes like sort_values do not change the dataframe permanently, they just change it for the time that the instruction is being used.

`df.sort_values(by='Make')` *dataframe is now in sorted order and can be copied to a new dataframe*  
`df` *original dataframe, df, will be as it was - unsorted*

To split the dataframe after sorting, do this in the same instruction, e.g.:

`df.sort_values(by = ["Make", "Model"], ascending = False])[["Make", "Model"]]`

This sorts on Make and then Model in descending order, then splits off the Make and Model columns.

`df.sort_values(by = ["Make", "Model"], ascending = False])[["Make", "Model"]].head()`

This sorts on Make and then Model, then splits off the Make and Model columns and then splits off the first 5 rows.

### Exercise 1 - get data, sort by happiness score
---

Read data from the Excel file on Happiness Data at this link: https://github.com/futureCodersSE/working-with-data/blob/main/Happiness-Data/2015.xlsx?raw=true

Display first 5 rows of data  

The data is currently sorted by Happiness Rank...
*  sort the data by Happiness Score in ascending order
*  display sorted table

**Test output**:  
The lowest score (displayed first) is 2.839, Togo  
The highest score (displayed last) is 7.587, Switzerland  



In [None]:
import pandas as pd

In [None]:
url_happiness = 'https://github.com/futureCodersSE/working-with-data/blob/main/Happiness-Data/2015.xlsx?raw=true'
happinessdf = pd.read_excel(url_happiness)
happinessdf.head()

Unnamed: 0,Country,Region,Happiness Rank,Happiness Score,Standard Error,Economy (GDP per Capita),Family,Health (Life Expectancy),Freedom,Trust (Government Corruption),Generosity,Dystopia Residual
0,Switzerland,Western Europe,1,7.587,0.03411,1.39651,1.34951,0.94143,0.66557,0.41978,0.29678,2.51738
1,Iceland,Western Europe,2,7.561,0.04884,1.30232,1.40223,0.94784,0.62877,0.14145,0.4363,2.70201
2,Denmark,Western Europe,3,7.527,0.03328,1.32548,1.36058,0.87464,0.64938,0.48357,0.34139,2.49204
3,Norway,Western Europe,4,7.522,0.0388,1.459,1.33095,0.88521,0.66973,0.36503,0.34699,2.46531
4,Canada,North America,5,7.427,0.03553,1.32629,1.32261,0.90563,0.63297,0.32957,0.45811,2.45176


In [None]:
happinessdf.sort_values('Happiness Score')

Unnamed: 0,Country,Region,Happiness Rank,Happiness Score,Standard Error,Economy (GDP per Capita),Family,Health (Life Expectancy),Freedom,Trust (Government Corruption),Generosity,Dystopia Residual
157,Togo,Sub-Saharan Africa,158,2.839,0.06727,0.20868,0.13995,0.28443,0.36453,0.10731,0.16681,1.56726
156,Burundi,Sub-Saharan Africa,157,2.905,0.08658,0.01530,0.41587,0.22396,0.11850,0.10062,0.19727,1.83302
155,Syria,Middle East and Northern Africa,156,3.006,0.05015,0.66320,0.47489,0.72193,0.15684,0.18906,0.47179,0.32858
154,Benin,Sub-Saharan Africa,155,3.340,0.03656,0.28665,0.35386,0.31910,0.48450,0.08010,0.18260,1.63328
153,Rwanda,Sub-Saharan Africa,154,3.465,0.03464,0.22208,0.77370,0.42864,0.59201,0.55191,0.22628,0.67042
...,...,...,...,...,...,...,...,...,...,...,...,...
4,Canada,North America,5,7.427,0.03553,1.32629,1.32261,0.90563,0.63297,0.32957,0.45811,2.45176
3,Norway,Western Europe,4,7.522,0.03880,1.45900,1.33095,0.88521,0.66973,0.36503,0.34699,2.46531
2,Denmark,Western Europe,3,7.527,0.03328,1.32548,1.36058,0.87464,0.64938,0.48357,0.34139,2.49204
1,Iceland,Western Europe,2,7.561,0.04884,1.30232,1.40223,0.94784,0.62877,0.14145,0.43630,2.70201


### Exercise 2 - sort by multiple columns, display the first 5 rows
---

1. sort the data by Economy (GDP per Capita) and Health (Life Expectancy) in ascending order
2. display the first 5 rows of sorted data

**Test output**:  
Records 122, 127, 147, 100, 96

In [None]:
sorted_table = happinessdf.sort_values(['Health (Life Expectancy)', 'Economy (GDP per Capita)']) # this is the other way than the question, to get the test output above.
sorted_table.head()
# I don't understand what this does, since both columns are numerical/continuous values, so the second sort doesn't change the first one
#                                                                              ############               #############

Unnamed: 0,Country,Region,Happiness Rank,Happiness Score,Standard Error,Economy (GDP per Capita),Family,Health (Life Expectancy),Freedom,Trust (Government Corruption),Generosity,Dystopia Residual
122,Sierra Leone,Sub-Saharan Africa,123,4.507,0.07068,0.33024,0.95571,0.0,0.4084,0.08786,0.21488,2.51009
127,Botswana,Sub-Saharan Africa,128,4.332,0.04934,0.99355,1.10464,0.04776,0.49495,0.12474,0.10461,1.46181
147,Central African Republic,Sub-Saharan Africa,148,3.678,0.06112,0.0785,0.0,0.06699,0.48879,0.08289,0.23835,2.7223
100,Swaziland,Sub-Saharan Africa,101,4.867,0.08742,0.71206,1.07284,0.07566,0.30658,0.0306,0.18259,2.48676
96,Lesotho,Sub-Saharan Africa,97,4.898,0.09438,0.37545,1.04103,0.07612,0.31767,0.12504,0.16388,2.79832


### Exercise 3 - sorting in descending order
---

Sort the data by Freedom and Trust (Government Corruption) in descending order and show the Country and Region only for the last five rows

**Test output**:
136, 117, 95, 101, 111 Country and Region columns


In [None]:
sorted_table = happinessdf.sort_values(['Freedom', 'Trust (Government Corruption)'], ascending = False)
sorted_table.tail()
#                                                                                                                         #######    #################

Unnamed: 0,Country,Region,Happiness Rank,Happiness Score,Standard Error,Economy (GDP per Capita),Family,Health (Life Expectancy),Freedom,Trust (Government Corruption),Generosity,Dystopia Residual
136,Angola,Sub-Saharan Africa,137,4.033,0.04758,0.75778,0.8604,0.16683,0.10384,0.07122,0.12344,1.94939
117,Sudan,Sub-Saharan Africa,118,4.55,0.0674,0.52107,1.01404,0.36878,0.10081,0.1466,0.19062,2.20857
95,Bosnia and Herzegovina,Central and Eastern Europe,96,4.949,0.06913,0.83223,0.91916,0.79081,0.09245,0.00227,0.24808,2.06367
101,Greece,Western Europe,102,4.857,0.05062,1.15406,0.92933,0.88213,0.07699,0.01397,0.0,1.80101
111,Iraq,Middle East and Northern Africa,112,4.677,0.05232,0.98549,0.81889,0.60237,0.0,0.13788,0.17922,1.95335


# Cleaning the data
---
Data comes from a range of sources:  forms, monitoring devices, etc.  There will often be missing values, duplicate records and values that are incorrectly formatted.  These can affect summary statistics and graphs plotted from the data.

Techniques for data cleansing include:
*  removing records with missing or null data (NaN, NA, "")
*  removing duplicate rows (keeping just one, either the first or the last)

Removal of rows according to criteria, or of columns are other ways that data might be cleaned up.  


## Removing NaN/Dropping NA values
---

pandas have functions for checking a dataframe, or column, for null values, checking a column for missing values, and functions for dropping all rows that contain null values.

* check for NA/NaN/missing values across dataframe (returns True if NA values exist)  
  `df.isnull().values.any()`  

* check for NA/NaN/missing values in specific column  
  `df["Make"].isnull().values.any()`  

* drop all rows that have NA/NaN values   
  `df.dropna()`  

* drop rows where NA/NaN values exist in specific columns  
  `df.dropna(subset = ["Make", "Model"])`  

### Exercise 4 - check for null values
---

1. read data from the file housing_in_london_yearly_variables.csv from this link: https://raw.githubusercontent.com/futureCodersSE/working-with-data/main/Data%20sets/housing_in_london_yearly_variables.csv
2. check if any NA values exist in the dataframe and print the result
3. use df.info() to see which columns have null entries (*Hint: if the non-null count is less than total entries, column contains missing/NA entries*)  

**Test output**:
True
.info shows median_salary, life_satisfaction, recycling_pct, population_size, number_of_jobs, area_size, no_of_houses all less than total rows (1071)



In [None]:
url_housingLondon = 'https://raw.githubusercontent.com/futureCodersSE/working-with-data/main/Data%20sets/housing_in_london_yearly_variables.csv'
housingdf = pd.read_csv(url_housingLondon)
housingdf

Unnamed: 0,code,area,date,median_salary,life_satisfaction,mean_salary,recycling_pct,population_size,number_of_jobs,area_size,no_of_houses,borough_flag
0,E09000001,city of london,1999-12-01,33020.0,,48922,0,6581.0,,,,1
1,E09000002,barking and dagenham,1999-12-01,21480.0,,23620,3,162444.0,,,,1
2,E09000003,barnet,1999-12-01,19568.0,,23128,8,313469.0,,,,1
3,E09000004,bexley,1999-12-01,18621.0,,21386,18,217458.0,,,,1
4,E09000005,brent,1999-12-01,18532.0,,20911,6,260317.0,,,,1
...,...,...,...,...,...,...,...,...,...,...,...,...
1066,K03000001,great britain,2019-12-01,30446.0,,37603,,,,,,0
1067,K04000001,england and wales,2019-12-01,30500.0,,37865,,,,,,0
1068,N92000002,northern ireland,2019-12-01,27434.0,,32083,,,,,,0
1069,S92000003,scotland,2019-12-01,30000.0,,34916,,,,,,0


In [None]:
print('Any Null values?:', housingdf.isnull().values.any(), '\n')
housingdf.info()

Any Null values?: True 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1071 entries, 0 to 1070
Data columns (total 12 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   code               1071 non-null   object 
 1   area               1071 non-null   object 
 2   date               1071 non-null   object 
 3   median_salary      1049 non-null   float64
 4   life_satisfaction  352 non-null    float64
 5   mean_salary        1071 non-null   object 
 6   recycling_pct      860 non-null    object 
 7   population_size    1018 non-null   float64
 8   number_of_jobs     931 non-null    float64
 9   area_size          666 non-null    float64
 10  no_of_houses       666 non-null    float64
 11  borough_flag       1071 non-null   int64  
dtypes: float64(6), int64(1), object(5)
memory usage: 100.5+ KB


### Exercise 5 - remove null values
---

1. remove rows with NA values for `life_satisfaction` (use [ ] even if only one column in list)
2. remove all NA values across whole dataframe

**Test output**:  
1.  Row count reduced to 352 rows
2.  Row count reduced to 267 rows

In [None]:
housingdf_nonulls = housingdf.dropna(subset = ["life_satisfaction"])
housingdf_nonulls['code'].count()

352

In [None]:
housingdf_nonulls = housingdf.dropna()
housingdf_nonulls['code'].count()

267

## Dropping duplicates
---

* To remove duplicate rows based on duplication of values in all columns  
  `df.drop_duplicates()`  

* To remove rows that have duplicate entries in a specified column  
  `df.drop_duplicates(subset = ['Make'])`  

* To remove rows that have duplicate entries in multiple columns  
  `df.drop_duplicates(subset = ['Make', 'Model'])`

* Remove duplicate rows keeping the last instance rather than the first (default):  
  `df.drop_duplicates(keep='last')`  

### Exercise 6 - Removing duplicate entries
---

remove duplicate `area` entries keeping first instance  

**Test output**:  
 Dataframe now contains 50 rows all with date 1999-12-*01*

In [None]:
housingdf_nodupls = housingdf.drop_duplicates(subset = ['area'])
housingdf_nodupls

Unnamed: 0,code,area,date,median_salary,life_satisfaction,mean_salary,recycling_pct,population_size,number_of_jobs,area_size,no_of_houses,borough_flag
0,E09000001,city of london,1999-12-01,33020.0,,48922,0,6581.0,,,,1
1,E09000002,barking and dagenham,1999-12-01,21480.0,,23620,3,162444.0,,,,1
2,E09000003,barnet,1999-12-01,19568.0,,23128,8,313469.0,,,,1
3,E09000004,bexley,1999-12-01,18621.0,,21386,18,217458.0,,,,1
4,E09000005,brent,1999-12-01,18532.0,,20911,6,260317.0,,,,1
5,E09000006,bromley,1999-12-01,16720.0,,21293,13,294902.0,,,,1
6,E09000007,camden,1999-12-01,23677.0,,30249,13,190003.0,,,,1
7,E09000008,croydon,1999-12-01,19563.0,,22205,13,332066.0,,,,1
8,E09000009,ealing,1999-12-01,20580.0,,25046,12,302252.0,,,,1
9,E09000010,enfield,1999-12-01,19289.0,,21006,9,272731.0,,,,1


# Normalising Data  
When we normalise data, we remodel a numeric column in a dataframe to be on a standard scale (e.g. 0 or 1).   

For example if we had a column of BMI scores, we could normalise that column so that all scores greater than or equal to 25 were recoded to the value 1 (bad) and all scores less than 25 were recoded to 0 (good).  

To normalise we need to:
*   write a function, with the dataframe as a parameter, which will look at each row in dataframe column and return either a value in the normalised scale (e.g. 0,1 or 1,2,3,4) depending on that value.

For example:  
```
def normalise_bmi(df):
  if df['bmi'] >= 25:
    return 1
  else:
    return 0

df["bmi"] = df.apply(normalise_bmi, axis=1)
```
This code reassigns the values in the column "bmi" by sending each row one after the other to the normalise_bmi function, which will check the value in the "bmi" column and return either 0 or 1 depending on the value in the "bmi" column.

### Exercise 7 - normalise data set
---

Create a function called **normalise_income(df)** that will return the values 1, 2 or 3 to represent low income, middle income and high income.  If the value in `df['median_salary']` is less than 27441 (the median), return 1, otherwise if it is less than 30932 (the upper quartile) return 2 and otherwise return 3.

Apply the normalise_income(df) function to the `median_salary` column.

*NOTE:  this operation will change the original dataframe so if you run it twice, everything in the median_salary column will change to 1 (as it had already been reduced to 1, 2 or 3 - if this happens, run the code in Exercise 4 again to get the original data again from the file.*

**Test output**:  
The maximum value of the column df['median_salary'] will be 3 and the minimum value will be 1  

In [None]:
housingdf['median_salary']

0       33020.0
1       21480.0
2       19568.0
3       18621.0
4       18532.0
         ...   
1066    30446.0
1067    30500.0
1068    27434.0
1069    30000.0
1070    27500.0
Name: median_salary, Length: 1071, dtype: float64

In [None]:
column_median = 27441               # I think I might have overcomplicated/ been overly specific with this, but it works
column_upperquart = 30932
column_length = len(housingdf['median_salary'])
def normalise_income(df):
  for i in range(0, column_length):
    if df['median_salary'].iloc[i] < column_median:
      df['median_salary'].iloc[i] = int(1)
    elif df['median_salary'].iloc[i] < column_upperquart:
      df['median_salary'].iloc[i] = int(2)
    else:
      df['median_salary'].iloc[i] = int(3)

normalise_income(housingdf)
housingdf['median_salary']

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['median_salary'].iloc[i] = int(3)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['median_salary'].iloc[i] = int(1)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['median_salary'].iloc[i] = int(2)


0       3.0
1       1.0
2       1.0
3       1.0
4       1.0
       ... 
1066    2.0
1067    2.0
1068    1.0
1069    2.0
1070    2.0
Name: median_salary, Length: 1071, dtype: float64

In [None]:
housingdf['median_salary'].min()

1.0

In [None]:
# first attempt. I was very specific in the one above that seems to have worked, but I feel like I didn't have to be?
# this doesn't change the value in the dataframe column, but it does the comparison for each value

column_median = 27441
column_upperquart = 30932
def normalise_income(housingdf):
  for salary_value in housingdf['median_salary']:
    if salary_value < column_median:
      salary_value = 1
    elif salary_value < column_upperquart:
      salary_value = 2
    else:
      salary_value = 3
    print(salary_value)

normalise_income(housingdf)

#housingdf['median_salary']

In [None]:
# this example from the W3 pandas tutorials (pandas dataframes section) might be relevant:

for x in df.index:
  if df.loc[x, "Duration"] > 120:
    df.drop(x, inplace = True)

### Exercise 8 - normalise the number of jobs column
---

Using what you have learnt from Exercise 7:  
*  use `df.describe()` to find the median, upper quartile and maximum for the number_of_jobs column  
*  create a function called **normalise_jobs(df)** that will return 1 if the `number_of_jobs` is below the median, 2 if the `number_of_jobs` is below the upper quartile or 3 otherwise.
*  normalise the `number_of_jobs` column by applying the function `normalise_jobs`.

**Test output**:  
The maximum value of the column df['number_of_jobs'] will be 3 and the minimum value will be 1  

In [None]:
housingdf['number_of_jobs'].describe()

count    9.310000e+02
mean     3.188095e+06
std      8.058302e+06
min      4.700000e+04
25%      9.450000e+04
50%      1.570000e+05
75%      2.217000e+06
max      3.575000e+07
Name: number_of_jobs, dtype: float64

In [None]:
housingdf['number_of_jobs']

0      NaN
1      NaN
2      NaN
3      NaN
4      NaN
        ..
1066   NaN
1067   NaN
1068   NaN
1069   NaN
1070   NaN
Name: number_of_jobs, Length: 1071, dtype: float64

In [None]:
column_median = 1.57e+05
column_upperquart = 2.217e+06
column_length = len(housingdf['number_of_jobs'])
def normalise_jobs(df):
  for i in range(0, column_length):
    if df['number_of_jobs'].iloc[i] < column_median:
      df['number_of_jobs'].iloc[i] = int(1)
    elif df['number_of_jobs'].iloc[i] < column_upperquart:
      df['number_of_jobs'].iloc[i] = int(2)
    else:
      df['number_of_jobs'].iloc[i] = int(3)

normalise_jobs(housingdf)
housingdf['number_of_jobs']

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['number_of_jobs'].iloc[i] = int(3)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['number_of_jobs'].iloc[i] = int(2)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['number_of_jobs'].iloc[i] = int(1)


0       3.0
1       3.0
2       3.0
3       3.0
4       3.0
       ... 
1066    3.0
1067    3.0
1068    3.0
1069    3.0
1070    3.0
Name: number_of_jobs, Length: 1071, dtype: float64

In [None]:
print('min:', housingdf['number_of_jobs'].min())
print('max:', housingdf['number_of_jobs'].max())

min: 1.0
max: 3.0


## Exercise 9 - normalise into a new column
---

Create a new function and code to normalise the `no_of_houses` column BUT this time, instead of assigning the result to `df['no_of_houses']` assign it to a new column called `df['housing_volume']`

**Test output**:  
The maximum value of the column df['housing_volume'] will be 3 and the minimum value will be 1

In [None]:
housingdf = pd.read_csv(url_housingLondon)

In [None]:
housingdf['no_of_houses'].describe()

count    6.660000e+02
mean     8.814682e+05
std      3.690376e+06
min      5.009000e+03
25%      8.763550e+04
50%      1.024020e+05
75%      1.262760e+05
max      2.417217e+07
Name: no_of_houses, dtype: float64

In [None]:
column_median = 1.02402e+05
column_upperquart = 1.26276e+05
column_length = len(housingdf['no_of_houses'])
housingdf.insert(11, 'housing_volume', 0)
def nurmalise_houses(df):
  for i in range(0, column_length):
    if df['no_of_houses'].iloc[i] < column_median:
      df['housing_volume'].iloc[i] = int(1)
    elif df['no_of_houses'].iloc[i] < column_upperquart:
      df['housing_volume'].iloc[i] = int(2)
    else:
      df['housing_volume'].iloc[i] = int(3)

nurmalise_houses(housingdf)
housingdf['housing_volume']

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['housing_volume'].iloc[i] = int(3)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['housing_volume'].iloc[i] = int(1)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['housing_volume'].iloc[i] = int(2)


0       3
1       3
2       3
3       3
4       3
       ..
1066    3
1067    3
1068    3
1069    3
1070    3
Name: housing_volume, Length: 1071, dtype: int64

In [None]:
print('min:', housingdf['housing_volume'].min())
print('max:', housingdf['housing_volume'].max(), '\n')
housingdf.head()

min: 1
max: 3 



Unnamed: 0,code,area,date,median_salary,life_satisfaction,mean_salary,recycling_pct,population_size,number_of_jobs,area_size,no_of_houses,housing_volume,borough_flag
0,E09000001,city of london,1999-12-01,33020.0,,48922,0,6581.0,,,,3,1
1,E09000002,barking and dagenham,1999-12-01,21480.0,,23620,3,162444.0,,,,3,1
2,E09000003,barnet,1999-12-01,19568.0,,23128,8,313469.0,,,,3,1
3,E09000004,bexley,1999-12-01,18621.0,,21386,18,217458.0,,,,3,1
4,E09000005,brent,1999-12-01,18532.0,,20911,6,260317.0,,,,3,1


### Exercise 10 - normalise boroughs
---

Normalise the `area_size` column so that all values below mean are represented as 0 and otherwise are 1.  Assign the output to a new column called `area_size_normalised`.  

**Test output**:  
`area_size_normalised` column will contain both 0s and 1s.  The position of the first row with value 1 will be 0 and the position of the first row with value 0 will be 102.


In [None]:
housingdf['area_size'].describe()

count    6.660000e+02
mean     3.724903e+05
std      2.157060e+06
min      3.150000e+02
25%      2.960000e+03
50%      4.323000e+03
75%      8.220000e+03
max      1.330373e+07
Name: area_size, dtype: float64

In [None]:
column_median = 4323
#column_upperquart = 8220
column_length = len(housingdf['area_size'])
housingdf.insert(10, 'area_size_normalised', 0)
def normalise_area(df):
  for i in range(0, column_length):
    if df['area_size'].iloc[i] < column_median:
      df['area_size_normalised'].iloc[i] = int(0)
    #elif df['area_size'].iloc[i] < column_upperquart:
      #df['area_size_normalised'].iloc[i] = int(2)
    else:
      df['area_size_normalised'].iloc[i] = int(1)

normalise_area(housingdf)
housingdf['area_size_normalised']

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['area_size_normalised'].iloc[i] = int(1)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['area_size_normalised'].iloc[i] = int(0)


0       1
1       1
2       1
3       1
4       1
       ..
1066    1
1067    1
1068    1
1069    1
1070    1
Name: area_size_normalised, Length: 1071, dtype: int64

In [None]:
# column will contain both 0s and 1s
new_column_content = housingdf['area_size_normalised'].unique()
print('column contains:', new_column_content)
# The position of the first row with value 1 will be 0 and the position of the first row with value 0 will be 102.
print('first 1:', housingdf['area_size_normalised'].idxmax())
print('first 0:', housingdf['area_size_normalised'].idxmin(), '\n')
housingdf.head()

column contains: [1 0]
first 1: 0
first 0: 102 



Unnamed: 0,code,area,date,median_salary,life_satisfaction,mean_salary,recycling_pct,population_size,number_of_jobs,area_size,area_size_normalised,no_of_houses,housing_volume,borough_flag
0,E09000001,city of london,1999-12-01,33020.0,,48922,0,6581.0,,,1,,3,1
1,E09000002,barking and dagenham,1999-12-01,21480.0,,23620,3,162444.0,,,1,,3,1
2,E09000003,barnet,1999-12-01,19568.0,,23128,8,313469.0,,,1,,3,1
3,E09000004,bexley,1999-12-01,18621.0,,21386,18,217458.0,,,1,,3,1
4,E09000005,brent,1999-12-01,18532.0,,20911,6,260317.0,,,1,,3,1
