# Common Cancers Deaths in Men (1990 - 2016)

Table of Contents  

- [Summary](#summary)  
- [Data Wrangling](#wrangling)
- [Exploratory Data Analysis](#eda)
- [Conclusions](#conclusions)


## <a name = "summary"> </a>Introduction

There are 5 common types of cancer in Men, lung cancer, stomach cancer, liver cancer, prostate cancer and colon&rectum cancer. In 2016, more than **1 million** Men die of one of this cancer over the world. 
Which country is affected the most by these cancers? What is the cancer which kill the most? How is the evolution of these cancer deaths since 1990?  
In this project we will analyse the data taken from the [Gapminder World](www.gapminder.org) about these 5 types
of cancer in Men and try to find an answer to these questions.



In [73]:
# Imports all the needed packages for the project
import pandas as pd                # Imports the package for the dataframes
import numpy as np                 # Imports the package for the arrays
import matplotlib.style as style   # Imports the package for the plot
import folium                      # Imports the package For the world map
style.use("fivethirtyeight")       # We use this style for the plots
from functools import reduce
import glob
%matplotlib inline                 

## <a name="wrangling"></a>Data Wrangling  
First of all we import all the separated files and gather it in one unique dataset.  

### General Properties

In [74]:
## Load all the files 
pattern = "files/*deaths.csv"      # Pattern to find all the files that finish with deaths 
all_files = glob.glob(pattern)     # Uses the pattern to import all the files and then put it in all_files list
files_names = ["col_rec_df", "stomach_df", "liver_df", "prostate_df", "lung_df"] # List of the name of each file
all_dict = {}                        # Creates an empty dictionary to store the names and dataframes
assert len(all_files) == len(files_names) # Controls if the number of files imported are the same number of names
for i in range(len(all_files)):
    all_dict[files_names[i]] = pd.read_csv(all_files[i]) # Loads each file gives it the right name and put in the dict 


Now that all the files are loaded with a unique name, let us take a look at the structure of each file.

In [75]:
# Cheks all the dataframe shape 
for name,df in all_dict.items(): 
    print(name, df.shape)       

col_rec_df (187, 28)
stomach_df (187, 28)
liver_df (187, 28)
prostate_df (187, 28)
lung_df (187, 28)


All five dataframes have same shape **187** observations and **28** variables

In [76]:
for name,df in all_dict.items():
    print(name, df.head(1)) 

col_rec_df        country   1990   1991   1992   1993   1994   1995   1996   1997   1998  \
0  Afghanistan  153.0  165.0  181.0  200.0  219.0  237.0  249.0  258.0  265.0   

   ...   2007   2008   2009   2010   2011   2012   2013   2014   2015   2016  
0  ...  403.0  418.0  434.0  452.0  470.0  492.0  516.0  543.0  571.0  599.0  

[1 rows x 28 columns]
stomach_df        country   1990   1991   1992   1993   1994   1995   1996   1997   1998  \
0  Afghanistan  482.0  519.0  568.0  630.0  698.0  758.0  797.0  826.0  850.0   

   ...    2007    2008    2009    2010    2011    2012    2013    2014  \
0  ...  1180.0  1200.0  1210.0  1230.0  1250.0  1280.0  1310.0  1350.0   

     2015    2016  
0  1400.0  1450.0  

[1 rows x 28 columns]
liver_df        country   1990   1991   1992   1993   1994   1995   1996   1997   1998  \
0  Afghanistan  124.0  134.0  146.0  161.0  176.0  191.0  199.0  206.0  211.0   

   ...   2007   2008   2009   2010   2011   2012   2013   2014   2015   2016  
0  ...  

All files are structured in the same way, a country column and years from 1990 to 2016, This structure is no optimal for making an analysis, let us turn it from wide to long for a better analysis.

In [84]:
# Transform the files from wide to long
def gather(df,value_name:str) -> "df": # Define the function gather that will return a dataframe
    """function to transform and rename the columns """
   
    df = df.melt(id_vars = ["country"],var_name = "year", value_name = value_name + "deaths") # This reshape the data from wide to long, renaming the columns 
    return df     # A data dataframe is returned

# Use the function gather to create "long" dataframes
list_df = []                                         # Create an empty list to contain the new "long" dataframes

for name,df in all_dict.items():                     # Use a loop to access all the files in the dictionnary
    name = gather(df,name[:-2])                      # Use the gather function, name[:-2] -> removes df from each names
    list_df.append(name)                             # Put all the new dataframe in the list 
    
df = reduce(lambda left,right: pd.merge(left,right, on = ["country", "year"]), list_df) # Use the reduce function to merge all the df in a row

df.to_csv("files/common_men_cancers_deaths.csv")   # Write into a csv for future uses
df.head()

Unnamed: 0,country,year,col_rec_deaths,stomach_deaths,liver_deaths,prostate_deaths,lung_deaths
0,Afghanistan,1990,153.0,482.0,124.0,130.0,577.0
1,Albania,1990,74.2,207.0,149.0,149.0,471.0
2,Algeria,1990,260.0,650.0,215.0,261.0,1150.0
3,Andorra,1990,6.72,4.02,1.25,7.16,13.7
4,Angola,1990,149.0,283.0,302.0,233.0,268.0


In [85]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 5049 entries, 0 to 5048
Data columns (total 7 columns):
country            5049 non-null object
year               5049 non-null object
col_rec_deaths     5049 non-null float64
stomach_deaths     5049 non-null float64
liver_deaths       5049 non-null float64
prostate_deaths    5049 non-null float64
lung_deaths        5049 non-null float64
dtypes: float64(5), object(2)
memory usage: 315.6+ KB


After putting all the files toghether, we have a file with 5049 rows and 7 columns. There are no missing values as
all the variables have the same numbers of rows, but the year variable is a string which is not correct. For the next text we will clean the data.  

### Cleaning Data 

We start cleaning the data turning the string year variable into a numeric variable, precisely an integer.

In [87]:
df.year = pd.to_numeric(df.year)          # Turn the year variable from string type to int64
assert df.year.dtypes == np.dtype("int64")# Check if the type is correct, if no error we are ok

As there is no error thrown, we are sure that the year variable type is correctly changed from string to integer.

In [88]:
assert df.notnull().all().all()          # We confirm that there are no missing values
assert (df.iloc[:, 1:] > 0).all().all()  # we confirm that all values are greater than zero

At this point we are sure that the data has no missing values, no zero values and all the variables are in
the correct type.

df

In [None]:
cancers_df.describe()

In [None]:
cancers_max_plt = cancers_df[cancers_df.lung_deaths == max(cancers_df.lung_deaths)].plot(kind = "bar", x = "year",figsize = (10,8))
plt.xlabel("Year", fontsize = 14)
plt.ylabel("Number of deaths", fontsize = 14)
plt.legend(fontsize = 18)
plt.title("China, Cancer deaths in 2016", fontsize = 16);

In 2016, **424000** Men dye of lung cancer just in China which is almost **12** times the number of prostate cancer deaths and almost all the population of Minneapolis city (USA).

In [None]:
# Main plot: Group cancers_df by year then sum to the have total number of deaths per cancer type
cancers_trends = cancers_df.groupby("year").sum().plot(legend = False, color = ["red", "brown", "blue", "green"], figsize = (10,8))
# Adds a title 
cancers_trends.text(x = 1, y = 1350000, s = "Men Cancers Deaths Trends", fontsize = 18, weight = "bold")
# Adds a subtitle
cancers_trends.text(x = 1, y = 1250000, s = "Evolution of the number of deaths from 1990 to 2016\ndue to the 4 most common cancer in Men",
                    fontsize = 16, alpha = 0.7)
# Adds the label of each legend on the line
cancers_trends.text(x=10, y = 1000000, s = "lung cancer", fontsize = 14, rotation = 20, color = "brown")
cancers_trends.text(x=5, y = 520000, s = "stomach cancer",fontsize = 14, color = "blue")
cancers_trends.text(x=1, y = 400000, s = "liver cancer", fontsize = 14, rotation = 14, color = "red")
cancers_trends.text(x=12, y = 320000, s = "prostate cancer", fontsize = 14, rotation = 6, color = "green")
# Adds a signature bar
cancers_trends.text(x = 1, y = 50000, s = "(c)Furawa                                                                                                Source: Gapminder World",
                   fontsize = 12, backgroundcolor = "grey", color = "white");

The lung Cancer is the one which kills more people over years. In 2016 more than 1 million Men dye of lung cancer 
over the world. Except the stomach cancer deaths (which is quite stable) all the other types of cancer deaths
increase each year. 

In [None]:
world = pd.read_csv("files/worldcities.csv")
world_df = world.iloc[:,2:5]
world_df.drop_duplicates(subset = "country", keep = "first", inplace = True)
world_df.sort_values(by = "country", inplace = True)

In [None]:
cancers_2016 = cancers_df[cancers_df.year == "2016"]
cancers_2016.loc[1:,"total_deaths"] = cancers_2016.sum(axis = 1)
cancers_2016.head()

In [None]:
world_map = world_df.merge(cancers_2016, how = "inner", on = "country")

In [None]:
cancer_map = folium.Map(location = [20,0], tiles = "Mapbox Bright", zoom_start = 3)

for i in range(0, len(world_map)):
    folium.Circle(
        location = [world_map.iloc[i]["lat"], world_map.iloc[i]["lng"]],
        popup = world_map.iloc[i]["country"],
        radius = world_map.iloc[i]["total_deaths"]*2,
        color = "purple",
        fill = True,
        fill_color = "purple"
    ).add_to(cancer_map)
cancer_map

<h2  = "conclusion"></a> Conclusions 