# Common Cancers Deaths in Men (1990 - 2016)

### Table of Contents  

- <a href = "#summary">Summary</a>  
- <a href = "#wrangling">Data Wrangling </a> 
- <a href = "#">Exploratory Data Analysis </a>  
- <a href = "#conclusion">Conclusions </a>   


<h2 id = "summary"> Introduction</h2>

There are 5 common types of cancer in Men, lung cancer, stomach cancer, liver cancer, prostate cancer and colon&rectum cancer. In 2016, more than **1 million** Men dye of one of this cancer over the world. 
Which country is affected the most by these cancers? What is the cancer which kill the most? How is the evolution of these deaths since 1990?  
In this project we will analyse the data taken from the [Gapminder World](www.gapminder.org) about these 5 types
of cancer in Men and try to find and answer to these questions.



In [1]:
# Imports all the needed packages for the project
import pandas as pd                # Imports the package for the dataframes
import numpy as np                 # Imports the package for the arrays
import matplotlib.style as style   # Imports the package for the plot
import folium                      # Imports the package For the world map
style.use("fivethirtyeight")       # We use this style for the plots
import glob
%matplotlib inline                 

<h2 id ="wrangling">Data Wrangling </h2>  
First of all we import all the separated files and gather it in one unique dataset.  

### General Properties

In [2]:
## Load all the files 
pattern = "files/*deaths.csv"
all_files = glob.glob(pattern)
files_names = ["col_rec_df", "stomach_df", "liver_df", "prostate_df", "lung_df"]
all_df = {}
assert len(all_files) == len(files_names)
for i in range(len(all_files)):
    all_df[files_names[i]] = pd.read_csv(all_files[i]) 
    
all_df

{'col_rec_df':                             country      1990      1991      1992      1993  \
 0                       Afghanistan    153.00    165.00    181.00    200.00   
 1                           Albania     74.20     75.70     76.20     77.30   
 2                           Algeria    260.00    274.00    306.00    342.00   
 3                           Andorra      6.72      7.31      7.86      8.13   
 4                            Angola    149.00    150.00    152.00    156.00   
 5               Antigua and Barbuda      2.49      2.49      2.49      2.48   
 6                         Argentina   2910.00   2950.00   3000.00   3050.00   
 7                           Armenia    141.00    150.00    159.00    164.00   
 8                         Australia   2270.00   2310.00   2360.00   2390.00   
 9                           Austria   1220.00   1230.00   1240.00   1250.00   
 10                       Azerbaijan    183.00    187.00    206.00    208.00   
 11                       

In [5]:
all_df["lung_df"].head()

Unnamed: 0,country,1990,1991,1992,1993,1994,1995,1996,1997,1998,...,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016
0,Afghanistan,577.0,621.0,678.0,748.0,822.0,888.0,927.0,955.0,979.0,...,1390.0,1430.0,1460.0,1510.0,1560.0,1620.0,1690.0,1760.0,1840.0,1930.0
1,Albania,471.0,478.0,477.0,478.0,482.0,497.0,510.0,533.0,559.0,...,699.0,716.0,731.0,736.0,741.0,750.0,763.0,781.0,801.0,818.0
2,Algeria,1150.0,1240.0,1220.0,1230.0,1240.0,1270.0,1200.0,1150.0,1110.0,...,1260.0,1280.0,1300.0,1320.0,1360.0,1410.0,1460.0,1510.0,1580.0,1640.0
3,Andorra,13.7,14.7,15.7,16.1,16.1,15.8,15.8,15.8,15.9,...,21.0,21.4,21.9,22.2,22.7,23.3,23.7,24.3,24.8,25.0
4,Angola,268.0,268.0,270.0,274.0,281.0,285.0,280.0,277.0,288.0,...,351.0,367.0,382.0,397.0,416.0,436.0,458.0,482.0,507.0,532.0


In [None]:
prostate_df.head(2)

In [None]:
stomach_df.head(2)

In [None]:
# Transform the files from wide to long
def gather(df,value_name:str) -> "df": # Define the function gather that will return a dataframe
    """function to transform and rename the columns """
    # Transform and rename in a row with the melt methods 
    df = df.melt(id_vars = ["country"],var_name = "year", value_name = value_name) 
    return df

# Use the gather function to transform each file 
liver_long = gather(liver_df, "liver_deaths")          # Transform liver_df to liver_long
lung_long = gather(lung_df, "lung_deaths")             # Transform lung_df to lung_long
stomach_long = gather(stomach_df, "stomach_deaths")    # Transform stomach_df to stomach_long
prostate_long = gather(prostate_df, "prostate_deaths") # transform prostate_df to prostate_long
liver_long.head() # Check the first 5 rows 

In [None]:
# Check the files to see if all is correct 
prostate_long.info()

Now that our file are ok, let put it all together in one unique dataframe

In [None]:
cancers_df = liver_long  # Create a new dataframe with all the values of liver_long dataframe 
cancers_df["lung_deaths"] = lung_long.lung_deaths  # Add the new column lung_deaths with its value from lung_long df
cancers_df["stomach_deaths"] = stomach_long.stomach_deaths    # Add the new column stomach_deaths with its values from stomach_long df 
cancers_df["prostate_deaths"] = prostate_long.prostate_deaths #Add the new column prostate_deaths with its values from prostate_long df 
cancers_df.head()  # Check the first few rows

In [None]:
# Check the info of the finale data 
cancers_df.info()

Checking the info of the final dataframe, we can see that there are no missing values, the data types are correct
we have a clean data. Let's see some statistic to have an overview


In [None]:
cancers_df.describe()

In [None]:
cancers_max_plt = cancers_df[cancers_df.lung_deaths == max(cancers_df.lung_deaths)].plot(kind = "bar", x = "year",figsize = (10,8))
plt.xlabel("Year", fontsize = 14)
plt.ylabel("Number of deaths", fontsize = 14)
plt.legend(fontsize = 18)
plt.title("China, Cancer deaths in 2016", fontsize = 16);

In 2016, **424000** Men dye of lung cancer just in China which is almost **12** times the number of prostate cancer deaths and almost all the population of Minneapolis city (USA).

In [None]:
# Main plot: Group cancers_df by year then sum to the have total number of deaths per cancer type
cancers_trends = cancers_df.groupby("year").sum().plot(legend = False, color = ["red", "brown", "blue", "green"], figsize = (10,8))
# Adds a title 
cancers_trends.text(x = 1, y = 1350000, s = "Men Cancers Deaths Trends", fontsize = 18, weight = "bold")
# Adds a subtitle
cancers_trends.text(x = 1, y = 1250000, s = "Evolution of the number of deaths from 1990 to 2016\ndue to the 4 most common cancer in Men",
                    fontsize = 16, alpha = 0.7)
# Adds the label of each legend on the line
cancers_trends.text(x=10, y = 1000000, s = "lung cancer", fontsize = 14, rotation = 20, color = "brown")
cancers_trends.text(x=5, y = 520000, s = "stomach cancer",fontsize = 14, color = "blue")
cancers_trends.text(x=1, y = 400000, s = "liver cancer", fontsize = 14, rotation = 14, color = "red")
cancers_trends.text(x=12, y = 320000, s = "prostate cancer", fontsize = 14, rotation = 6, color = "green")
# Adds a signature bar
cancers_trends.text(x = 1, y = 50000, s = "(c)Furawa                                                                                                Source: Gapminder World",
                   fontsize = 12, backgroundcolor = "grey", color = "white");

The lung Cancer is the one which kills more people over years. In 2016 more than 1 million Men dye of lung cancer 
over the world. Except the stomach cancer deaths (which is quite stable) all the other types of cancer deaths
increase each year. 

In [None]:
world = pd.read_csv("files/worldcities.csv")
world_df = world.iloc[:,2:5]
world_df.drop_duplicates(subset = "country", keep = "first", inplace = True)
world_df.sort_values(by = "country", inplace = True)

In [None]:
cancers_2016 = cancers_df[cancers_df.year == "2016"]
cancers_2016.loc[1:,"total_deaths"] = cancers_2016.sum(axis = 1)
cancers_2016.head()

In [None]:
world_map = world_df.merge(cancers_2016, how = "inner", on = "country")

In [None]:
cancer_map = folium.Map(location = [20,0], tiles = "Mapbox Bright", zoom_start = 3)

for i in range(0, len(world_map)):
    folium.Circle(
        location = [world_map.iloc[i]["lat"], world_map.iloc[i]["lng"]],
        popup = world_map.iloc[i]["country"],
        radius = world_map.iloc[i]["total_deaths"]*2,
        color = "purple",
        fill = True,
        fill_color = "purple"
    ).add_to(cancer_map)
cancer_map

<h2 id = "conclusion"> Conclusions </h2>