# Common Cancers Deaths in Men (1990 - 2016)

### Table of Contents  

- <a href = "#summary">Summary</a>  
- <a href = "#wrangling">Data Wrangling </a> 
- <a href = "#">Exploratory Data Analysis </a>  
- <a href = "#conclusion">Conclusions </a>   


<h2 id = "summary"> Introduction</h2>

There are 5 common types of cancer in Men, lung cancer, stomach cancer, liver cancer, prostate cancer and colon&rectum cancer. In 2016, more than **1 million** Men dye of one of this cancer over the world. 
Which country is affected the most by these cancers? What is the cancer which kill the most? How is the evolution of these deaths since 1990?  
In this project we will analyse the data taken from the [Gapminder World](www.gapminder.org) about these 5 types
of cancer in Men and try to find and answer to these questions.



In [1]:
# Imports all the needed packages for the project
import pandas as pd                # Imports the package for the dataframes
import numpy as np                 # Imports the package for the arrays
import matplotlib.style as style   # Imports the package for the plot
import folium                      # Imports the package For the world map
style.use("fivethirtyeight")       # We use this style for the plots
import glob
%matplotlib inline                 

<h2 id ="wrangling">Data Wrangling </h2>  
First of all we import all the separated files and gather it in one unique dataset.  

### General Properties

In [48]:
## Load all the files 
pattern = "files/*deaths.csv"      # Pattern to find all the files that finish with deaths 
all_files = glob.glob(pattern)     # Uses the pattern to import all the files and then put it in all_files list
files_names = ["col_rec_df", "stomach_df", "liver_df", "prostate_df", "lung_df"] # List of the name of each file
all_dict = {}                        # Creates an empty dictionary to store the names and dataframes
assert len(all_files) == len(files_names) # Controls if the number of files imported are the same number of names
for i in range(len(all_files)):
    all_dict[files_names[i]] = pd.read_csv(all_files[i]) # Loads each file gives it the right name and put in the dict 


Now that all the files are loaded with a unique name, let us take a look at the structure of each file.

In [49]:
# Cheks all the dataframe shape 
for name,df in all_df.items(): 
    print(name, df.shape)       

col_rec_df (187, 28)
stomach_df (187, 28)
liver_df (187, 28)
prostate_df (187, 28)
lung_df (187, 28)


All five dataframes have same shape **187** observations and **28** variables

In [44]:
for name,df in all_df.items():
    print(name, df.head(1)) 

col_rec_df        country   1990   1991   1992   1993   1994   1995   1996   1997   1998  \
0  Afghanistan  153.0  165.0  181.0  200.0  219.0  237.0  249.0  258.0  265.0   

   ...   2007   2008   2009   2010   2011   2012   2013   2014   2015   2016  
0  ...  403.0  418.0  434.0  452.0  470.0  492.0  516.0  543.0  571.0  599.0  

[1 rows x 28 columns]
stomach_df        country   1990   1991   1992   1993   1994   1995   1996   1997   1998  \
0  Afghanistan  482.0  519.0  568.0  630.0  698.0  758.0  797.0  826.0  850.0   

   ...    2007    2008    2009    2010    2011    2012    2013    2014  \
0  ...  1180.0  1200.0  1210.0  1230.0  1250.0  1280.0  1310.0  1350.0   

     2015    2016  
0  1400.0  1450.0  

[1 rows x 28 columns]
liver_df        country   1990   1991   1992   1993   1994   1995   1996   1997   1998  \
0  Afghanistan  124.0  134.0  146.0  161.0  176.0  191.0  199.0  206.0  211.0   

   ...   2007   2008   2009   2010   2011   2012   2013   2014   2015   2016  
0  ...  

All files are structured in the same way, a country column and years from 1990 to 2016, This structure is no optimal for making an analysis, let us turn it from wide to long for a better analysis.

In [68]:
# Transform the files from wide to long
def gather(df,value_name:str) -> "df": # Define the function gather that will return a dataframe
    """function to transform and rename the columns """
    # Transform and rename in a row with the melt methods 
    df = df.melt(id_vars = ["country"],var_name = "year", value_name = value_name +"deaths") 
    return df
dfs = []
for name,df in all_df.items():
    name = gather(df,name[:-2])
    dfs.append(name)
pd.concat(dfs, join = "inner", axis = 1 )

Unnamed: 0,country,year,col_rec_deaths,country.1,year.1,stomach_deaths,country.2,year.2,liver_deaths,country.3,year.3,prostate_deaths,country.4,year.4,lung_deaths
0,Afghanistan,1990,153.00,Afghanistan,1990,482.00,Afghanistan,1990,124.00,Afghanistan,1990,130.00,Afghanistan,1990,577.00
1,Albania,1990,74.20,Albania,1990,207.00,Albania,1990,149.00,Albania,1990,149.00,Albania,1990,471.00
2,Algeria,1990,260.00,Algeria,1990,650.00,Algeria,1990,215.00,Algeria,1990,261.00,Algeria,1990,1150.00
3,Andorra,1990,6.72,Andorra,1990,4.02,Andorra,1990,1.25,Andorra,1990,7.16,Andorra,1990,13.70
4,Angola,1990,149.00,Angola,1990,283.00,Angola,1990,302.00,Angola,1990,233.00,Angola,1990,268.00
5,Antigua and Barbuda,1990,2.49,Antigua and Barbuda,1990,5.95,Antigua and Barbuda,1990,2.27,Antigua and Barbuda,1990,12.50,Antigua and Barbuda,1990,3.51
6,Argentina,1990,2910.00,Argentina,1990,3450.00,Argentina,1990,845.00,Argentina,1990,2690.00,Argentina,1990,7020.00
7,Armenia,1990,141.00,Armenia,1990,326.00,Armenia,1990,129.00,Armenia,1990,53.80,Armenia,1990,716.00
8,Australia,1990,2270.00,Australia,1990,909.00,Australia,1990,270.00,Australia,1990,2050.00,Australia,1990,4580.00
9,Austria,1990,1220.00,Austria,1990,990.00,Austria,1990,350.00,Austria,1990,890.00,Austria,1990,2490.00


In [None]:
# Check the files to see if all is correct 
prostate_long.info()

Now that our file are ok, let put it all together in one unique dataframe

In [None]:
cancers_df = liver_long  # Create a new dataframe with all the values of liver_long dataframe 
cancers_df["lung_deaths"] = lung_long.lung_deaths  # Add the new column lung_deaths with its value from lung_long df
cancers_df["stomach_deaths"] = stomach_long.stomach_deaths    # Add the new column stomach_deaths with its values from stomach_long df 
cancers_df["prostate_deaths"] = prostate_long.prostate_deaths #Add the new column prostate_deaths with its values from prostate_long df 
cancers_df.head()  # Check the first few rows

In [None]:
# Check the info of the finale data 
cancers_df.info()

Checking the info of the final dataframe, we can see that there are no missing values, the data types are correct
we have a clean data. Let's see some statistic to have an overview


In [None]:
cancers_df.describe()

In [None]:
cancers_max_plt = cancers_df[cancers_df.lung_deaths == max(cancers_df.lung_deaths)].plot(kind = "bar", x = "year",figsize = (10,8))
plt.xlabel("Year", fontsize = 14)
plt.ylabel("Number of deaths", fontsize = 14)
plt.legend(fontsize = 18)
plt.title("China, Cancer deaths in 2016", fontsize = 16);

In 2016, **424000** Men dye of lung cancer just in China which is almost **12** times the number of prostate cancer deaths and almost all the population of Minneapolis city (USA).

In [None]:
# Main plot: Group cancers_df by year then sum to the have total number of deaths per cancer type
cancers_trends = cancers_df.groupby("year").sum().plot(legend = False, color = ["red", "brown", "blue", "green"], figsize = (10,8))
# Adds a title 
cancers_trends.text(x = 1, y = 1350000, s = "Men Cancers Deaths Trends", fontsize = 18, weight = "bold")
# Adds a subtitle
cancers_trends.text(x = 1, y = 1250000, s = "Evolution of the number of deaths from 1990 to 2016\ndue to the 4 most common cancer in Men",
                    fontsize = 16, alpha = 0.7)
# Adds the label of each legend on the line
cancers_trends.text(x=10, y = 1000000, s = "lung cancer", fontsize = 14, rotation = 20, color = "brown")
cancers_trends.text(x=5, y = 520000, s = "stomach cancer",fontsize = 14, color = "blue")
cancers_trends.text(x=1, y = 400000, s = "liver cancer", fontsize = 14, rotation = 14, color = "red")
cancers_trends.text(x=12, y = 320000, s = "prostate cancer", fontsize = 14, rotation = 6, color = "green")
# Adds a signature bar
cancers_trends.text(x = 1, y = 50000, s = "(c)Furawa                                                                                                Source: Gapminder World",
                   fontsize = 12, backgroundcolor = "grey", color = "white");

The lung Cancer is the one which kills more people over years. In 2016 more than 1 million Men dye of lung cancer 
over the world. Except the stomach cancer deaths (which is quite stable) all the other types of cancer deaths
increase each year. 

In [None]:
world = pd.read_csv("files/worldcities.csv")
world_df = world.iloc[:,2:5]
world_df.drop_duplicates(subset = "country", keep = "first", inplace = True)
world_df.sort_values(by = "country", inplace = True)

In [None]:
cancers_2016 = cancers_df[cancers_df.year == "2016"]
cancers_2016.loc[1:,"total_deaths"] = cancers_2016.sum(axis = 1)
cancers_2016.head()

In [None]:
world_map = world_df.merge(cancers_2016, how = "inner", on = "country")

In [None]:
cancer_map = folium.Map(location = [20,0], tiles = "Mapbox Bright", zoom_start = 3)

for i in range(0, len(world_map)):
    folium.Circle(
        location = [world_map.iloc[i]["lat"], world_map.iloc[i]["lng"]],
        popup = world_map.iloc[i]["country"],
        radius = world_map.iloc[i]["total_deaths"]*2,
        color = "purple",
        fill = True,
        fill_color = "purple"
    ).add_to(cancer_map)
cancer_map

<h2 id = "conclusion"> Conclusions </h2>