# Breakdown of this notebook:
1. **Loading the dataset:** Load the data and import the libraries.
1. **Data Cleaning:**
     * Deleting redundant columns.
     * Renaming the columns.
     * Dropping duplicates.
     * Cleaning individual columns.
1. **Data Visualization:** Necessary Visualization plots
1. **Do State Capitals have higher HDI?**
1. **HDI Distribution**
1. **Which cities have highest GDP per Capita?**
1. **Is there a Population Growth trend with respect to GDP per Capita?**
       ** Does this signify that the people found it logical to move to cities with high GDP per Capita?**
1.  **Do cities with large area and high GDP per Capita tend to have high Population Growth % ?**
1.  **Do cities with more companies show more population growth?**
1.  **Is Car Distribution localised to specific parts of Brazil?**

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns
import matplotlib.pyplot as plt # this is used for the plot the graph 
from matplotlib import rcParams
# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

import os
print(os.listdir("../input"))

# Any results you write to the current directory are saved as output.

In [None]:
data=pd.read_csv("../input/BRAZIL_CITIES.csv", sep=";", decimal=",")
data.head()

In [None]:
data.shape

In [None]:
data.UBER = data.UBER.replace(np.nan,0)
data.POST_OFFICES = data.POST_OFFICES.replace(np.nan,1)
data.CAPITAL = data.CAPITAL.replace(0,'NO')
data.CAPITAL = data.CAPITAL.replace(1,'YES')

In [None]:
original_data = data.copy(True)

In [None]:
columns = ['CITY', 'STATE', 'CAPITAL', 'IBGE_RES_POP','AREA',
           'IDHM','LONG','LAT','ALT','ESTIMATED_POP','GDP','GDP_CAPITA','COMP_TOT',
           'Cars','Motorcycles','UBER','Wheeled_tractor','POST_OFFICES']
df = data[columns]

In [None]:
df.head()

In [None]:
print("Percentage null or na values in df")
((df.isnull() | df.isna()).sum() * 100 / df.index.size).round(2)

In [None]:
df.dropna(how ='any', inplace = True)
print("Percentage null or na values in df")
((df.isnull() | df.isna()).sum() * 100 / df.index.size).round(2)

In [None]:
df.rename(columns={'IBGE_RES_POP': 'Population(2010)', 
                    'IDHM':'Human Development Index Ranking',
                    'ESTIMATED_POP':'Estimated Population(2018)',
                    'COMP_TOT':'Total companies',
                    'UBER':'Uber',
                    'POST_OFFICES':'Post Offices'}, inplace=True)

**Let's create a new column that gives better understanding of population growth from 2010 to 2018**

In [None]:
df= df.drop_duplicates(subset='CITY',keep='first')
df['Population Growth %']=((df['Estimated Population(2018)']-df['Population(2010)'])/(df['Population(2010)']))*100

**It always becomes easier to find relationships between columns if we plot a correlation plot first.**

In [None]:
correlation= df.corr()
sns.heatmap(correlation)

# Do State Capitals have higher HDI?

In [None]:
capital_hdi=sns.violinplot(x = 'CAPITAL', y = 'Human Development Index Ranking', data = df, palette = "Set3")
capital_hdi.set_xlabel(xlabel = 'Capital', fontsize = 9)
capital_hdi.set_ylabel(ylabel = 'Human Development Index Ranking', fontsize = 9)
capital_hdi.set_title(label = 'Capitals vs HDI', fontsize = 20)
plt.show()

**State Capitals do have a higher Human Development Index Ranking** (*mostly greater than 0.7*) as is evident from the plot above !!

## HDI Distribution

In [None]:
# df['Human Development Index Ranking'].mean()
fig, ax = plt.subplots(figsize=[16,4])
category_plot = sns.distplot(df['Human Development Index Ranking'],ax=ax)
ax.set_title( 'HDI Distrubution for all cities')


**Human Development Index Ranking for most cities lie between 0.5 and 0.75 as is evident through the graph above.**

## Necessary Visualization plots
Before we jump into the analysis and find relationships between various columns of this dataframe. It is necessary to plot the following plots that will make it easier to understand the rest of the following kernel.
These plots focus on :-
1. Area of the cities
1. Total Companies in every city
1. Estimated Population Growth % per city

In [None]:
cmap = sns.cubehelix_palette(dark=.1, light=.3, as_cmap=True)


f, ax = plt.subplots(figsize=(8, 8))
sns.scatterplot(x=df[df['AREA']>= 1500000].LONG,
                y=df[df['AREA']>= 1500000].LAT ,
                palette =cmap,
                hue=df['AREA'],
                size=df['AREA'])
plt.title("Location of Cities with large Area")


f, ax = plt.subplots(figsize=(8, 8))
sns.scatterplot(x=df[df['Total companies']>1000].LONG,
                y=df[df['Total companies']>1000].LAT,
                palette =cmap,
                hue=df['Total companies'],
                size=df['Total companies'])
plt.title("Location of Cities with large number of Companies")



f, ax = plt.subplots(figsize=(8, 8))
sns.scatterplot(x=df[df['Population Growth %'] > 15].LONG,
                y=df[df['Population Growth %'] > 15].LAT,
                palette =cmap,
                hue=df['Population Growth %'],
                size=df['Population Growth %'])
plt.title("Location of Cities with large Population Growth %")

* We can see that the **Southeast** part of Brazil show large number of cities with **high estimated population growth %**.
* Also, the **Eest and Southeast** part of Brazil have a **large number of companies**.
* The largest cities in Brazil are in the **Northwest** but a large number of huge cities are situated in the **East** of Brazil.

So,it will be interesting to see relationship between the above visualized columns.* Let's do that !*

# Which cities have highest GDP per Capita?

The criteria here is GDP per Capita greater than 80k.

In [None]:
# df['GDP_CAPITA'].max()
newdf=df[['CITY','GDP_CAPITA','CAPITAL','STATE']].groupby(['GDP_CAPITA'])
newdf=newdf.filter(lambda x: x.mean() >= 80000)
newdf=newdf.sort_values(by=['GDP_CAPITA'])
newdf

In [None]:
plt.rcParams['figure.figsize'] = (15, 9)
newdf.STATE.value_counts().sort_index().plot.bar()

* **78 cities** in Brazil have GDP per Capita greater than 80k. **We will use the above obtained dataframe of 78 cities for further analysis below !!**
* Most Cities in this list belong to the **State of São Paulo**.
* Interestingly all the cities in this list are **non-capitals**.

# Is there a Population Growth trend with respect to GDP per Capita?

In [None]:
# df['Population Growth %'].max()
newdf_pop=df[['CITY','Population Growth %']].groupby(['Population Growth %'])
newdf_pop=newdf_pop.filter(lambda x: x.mean() <= -10)
newdf_pop=newdf_pop.sort_values(by=['Population Growth %'])

newdf_pop_grow=df[['CITY','Population Growth %']].groupby(['Population Growth %'])
newdf_pop_grow=newdf_pop_grow.filter(lambda x: x.mean() >= 10)
newdf_pop_grow=newdf_pop_grow.sort_values(by=['Population Growth %'])

In [None]:
pop_decrease = pd.merge(newdf, newdf_pop, how='inner', on=['CITY'])
print('Cities with Estimated Population decrease with high GDP_CAPITA \n')
pop_decrease

In [None]:
pop_increase=pd.merge(newdf, newdf_pop_grow, how='inner', on=['CITY'])
print('Cities with Estimated Population increase with high GDP_CAPITA \n')
pop_increase

**There are 50 cities in Brazil with high GDP per Capita and high estimated Population Growth % **

In [None]:
plt.rcParams['figure.figsize'] = (15, 9)
pop_increase.STATE.value_counts().sort_index().plot.bar()
plt.title('States with high Population growth due to high GDP per capita',size=20)

## Does this signify that the people found it logical to move to cities with high GDP per Capita?

* 50 out of 78 cities with **high GDP per Capita(greater than 80k)** showed **high estimated population growth(more than 10%)** from **2010 to 2018**. *I think it makes sense !* 
* **Piratuba** and **Rosana** are the obvious exceptions to this analysis. **Piratuba** and **Rosana** showed an estimated population decrease of **17%** and **13%** respectively from **2010 to 2018** even though the cities had a high GDP per Capita !!
* **Sao Paolo** clearly has **highest** share of cities in this analysis. **Sao Paolo** has **21 cities** with high GDP per Capita and high estimated population growth% !!

# Do cities with large area and high GDP per Capita tend to have high Population Growth % ?

In [None]:
# df.AREA.mean()
newdf_area=df[['CITY','AREA']].groupby(['AREA'])
newdf_area=newdf_area.filter(lambda x: x.mean() >= 1254967)
newdf_area=newdf_area.sort_values(by=['AREA'])

In [None]:
pop_area=pd.merge(pop_increase, newdf_area, how='inner', on=['CITY'])
pop_area

* The large cities of the state of **Mato Grosso (MT)** have a surprisingly high estimated population growth % !!
* The average population growth % in the cities of **Mato Grosso** is greater than **25%** !

# Do cities with more companies show more population growth?

In [None]:
newdf_companies=df[['CITY','Total companies']].groupby(['Total companies'])
newdf_companies=newdf_companies.filter(lambda x: x.mean() >= 1000)
newdf_companies=newdf_companies.sort_values(by=['Total companies'])
newdf_companies

1. As you can see above, Only **713 out of 5576** total cities in our dataframe have **more than 1000 companies**!
1. As you can see below, **345 out of these 713** cities (almost 50%) show more than **10 % increase** in population from 2010 to 2018. 

*It is safe to say that cities with large number of companies understandably tend to have a large increase in population. The reason can be more job opportunities in big cities which attracts lot of people to these cities*

In [None]:
pop_companies=pd.merge(newdf_pop_grow, newdf_companies, how='inner', on=['CITY'])
pop_companies

# Is Car Distribution localised to specific parts of Brazil?

Lets create a new column **'Cars Distribution'** to get a better understanding of **Number of Cars per person** in every city of Brazil.

**To filter the cities, I have assumed a condition that on an average 5 people own a car.**

In [None]:

df['Cars distribution']=((df['Cars'])/(df['Estimated Population(2018)']))
# df.head()
# df['Cars distribution'].describe()
f, ax = plt.subplots(figsize=(8, 8))
sns.scatterplot(x=df[df['Cars distribution'] >= 0.20].LONG,
                y=df[df['Cars distribution'] >= 0.20].LAT,
                palette =cmap,
                hue=df['Cars distribution'],
                size=df['Cars distribution'])

In [None]:
correlation= df.corr()
sns.heatmap(correlation)

* **South and Southeast part** of Brazil has the **highest Car to person ratio**. 
* This totally **correlates** with **plot of cities with large number of companies** as done above in the '**Necessary Visualization plots' part of this kernel.** This signifies that people living in cities with large number of companies use more cars to commute than other cities.
* Also. it is interesting to see that our newly created column **'Cars Distribution' correlates highly with Latitude and Longitude**!!

# Please upvote and feel free to give any feedback/comment below!!