## Barcelona Population Data Exploration

Much of our data for this project will need population data to normalize statistics. So in this notebook, I've taken the gender population data from 2011-2019 to calculate a total population number for the different neighborhoods in Barcelona. Surprisingly, we were unable to just find population data proper but this was a good workaround. 

We don't have any graphics here, but have developed a function that was called in a for loop to clean the data for each of the years. Additionally, we decided against including this notebook with Casey's notebook for this week because this file really stands alone to clean data and export a usable dataframe that will provide context for other graphics.  

### Import & Clean First Year's Data

In [1]:
#import pandas for data stuff
import pandas as pd
#import geopandas for spatial data stuff
import geopandas as gpd

In [2]:
#start with the beginning dataset 
pdf = gpd.read_file('data/2011_population by gender.csv')

The data has extra unneeded columns. 

In [3]:
#trim the df to only include the columns needed; this will delete the gender titles as well.
pdf_trimmed = pdf[['Codi_Barri',
 'Nom_Barri',
 'Nombre']]

I need to combine the two sexes for each Barri. 

In [4]:
#change datatype of Number to integer
pdf_trimmed['Nombre'] = pdf_trimmed['Nombre'].astype(int)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  pdf_trimmed['Nombre'] = pdf_trimmed['Nombre'].astype(int)


In [5]:
#group by code barri and name barri, summing the numbers of the sexes. 
pdfa= pdf_trimmed.groupby(['Codi_Barri', 'Nom_Barri']).sum()[['Nombre']]

In [6]:
#reset the index. flatten
pdfb = pdfa.reset_index()

Excellent, now I have a dataframe with the three columns needed: code barri, name barri, and the population. But the population title needs to change. I want to set the column title to population and the year. I've decided to set this up through a variable so that I can iterate over it in the function. 

In [7]:
#set variable for population year
x=2011

In [8]:
#change column names
pdfb.columns = ['c_barri', 'n_barri', 'population '+ str(x)]

### Import & Clean Subsequent Years' Data

For 2012-2019, I'd like to create a function for cleaning the data using the code above and making some tweaks to allow for a for loop. I'll first bring in the data to pass into a list, then create a function, and then loop the function over the list. 

In [9]:
pdf12 = gpd.read_file('data/2012_population by gender.csv')
pdf13 = gpd.read_file('data/2013_population by gender.csv')
pdf14 = gpd.read_file('data/2014_population by gender.csv')
pdf15 = gpd.read_file('data/2015_population by gender.csv')
pdf16 = gpd.read_file('data/2016_population by gender.csv')
pdf17 = gpd.read_file('data/2017_population by gender.csv')
pdf18 = gpd.read_file('data/2018_population by gender.csv')
pdf19 = gpd.read_file('data/2019_population by gender.csv')

In [10]:
#pass each df into a list 
years = [pdf12, pdf13, pdf14, pdf15, pdf16, pdf17, pdf18, pdf19]

Now I have a list and need to create the function. 

For the function, I needed to create a variable that catches the year from the first column of the dataframe before I trim it. After trimming and summing the genders, I use that year to create the name of the column. At the end of the function, I should return a dataframe of the barri code and population (excluding the neighborhood name because I have that listed in the first dataframe).

In [11]:
#create function for cleaning each dataframe
def clean(df):
    #grab year from 1st column 
    x=df.iloc[0][0]
    #trim the df
    pdf_trimmed = df[['Codi_Barri', 'Nombre']]
    #change number to integer for the sum in groupby
    pdf_trimmed['Nombre'] = pdf_trimmed['Nombre'].astype(int) 
    pdfa= pdf_trimmed.groupby(['Codi_Barri']).sum()[['Nombre']]
    pdfb = pdfa.reset_index()
    #rename column titles
    pdfb.columns = ['c_barri', 'population '+ str(x)]
    #show me the $$!!
    return pdfb

In [12]:
#create empty list for the new dfs
new_years = []
#LOOP
for df in years:
    new_years.append(clean(df))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  pdf_trimmed['Nombre'] = pdf_trimmed['Nombre'].astype(int)


This kind of worked. I don't have names for the dataframes within new_years. But I can index the list to get the different dataframes.

### Merge Data

In [13]:
#grab df from new_years to merge to pdfb
final = pdfb.merge(new_years[0].merge(new_years[1].merge(new_years[2].merge(new_years[3].merge(new_years[4].merge(new_years[5].merge(new_years[6].merge(new_years[7], on='c_barri'),on='c_barri'),on='c_barri'),on='c_barri'),on='c_barri'),on='c_barri'),on='c_barri'), on='c_barri')
#take a peek at the final product
final.head()

Unnamed: 0,c_barri,n_barri,population 2011,population 2012,population 2013,population 2014,population 2015,population 2016,population 2017,population 2018,population 2019
0,1,el Raval,47700,49027,48800,47489,47142,47129,47608,46948,47353
1,10,Sant Antoni,38002,38372,38260,38096,38033,38184,38345,38090,38236
2,11,el Poble Sec,40547,41258,40984,40278,40208,40055,40228,39520,39995
3,12,la Marina del Prat Vermell,1124,1153,1172,1144,1125,1143,1149,1158,1196
4,13,la Marina de Port,30243,30458,30271,30047,30232,30385,30584,30660,30958


### Export Dataframe 

I want to have this as a standalone new file to use later. 

In [14]:
#export as csv to data folder
df.to_csv(r'data/2011_2019_BarcelonaPop_Barri.csv')