# Lung Cancer Risk in 25 Countries

## Introduction 
This dataset contains information on lung cancer risk factors across various countries, focusing on demographic details, smoking behaviors, and family history. This dataset can be used to study patterns of lung cancer incidence, identify trends related to smoking and passive smoking exposure, and assess the impact of family history on lung cancer risk.

### Purpose of analysis
- **Risk Factor Analysis**: Analyze how smoking habits, exposure to secondhand smoke, and family history correlate with lung cancer risk.  
- **Demographic Insights**: Explore how age and gender impact the prevalence of lung cancer risk factors.  
- **Regional variations**: Compare lung cancer risk factors across different countries and regions.  
- **Public Health Research**: Identify populations with high-risk behaviors and suggest interventions or preventive measures.

### Data Dictionary

| Column Name                      | Description                                                                 |
|----------------------------------|-----------------------------------------------------------------------------|
| `ID`                             | Unique identifier for each record                                           |
| `Country`                        | Name of the country                                                         |
| `Population_Size`               | Total population size of the country (in millions)                         |
| `Age`                            | Age of the individual (in years)                                            |
| `Gender`                         | Gender of the individual (`Male` / `Female`)                                |
| `Smoker`                         | Whether the individual is a smoker (`Yes` / `No`)                           |
| `Years_of_Smoking`              | Number of years the individual has been smoking                             |
| `Cigarettes_per_Day`           | Average number of cigarettes smoked per day                                 |
| `Passive_Smoker`                | Whether the individual is regularly exposed to secondhand smoke             |
| `Family_History`                | Whether there is a family history of lung cancer (`Yes` / `No`)             |
| `Chronic_Lung_Disease`         | Presence of pre-existing chronic lung conditions (`Yes` / `No`)             |
| `Genetic_Mutation`             | Whether the individual has genetic mutations linked to cancer risk (`Yes` / `No`) |
| `Radiation_Exposure`           | Exposure to radiation (`Yes` / `No`)                                        |
| `Air_Pollution_Exposure`       | Level of exposure to air pollution (`Low` / `Medium` / `High`)              |
| `Occupational_Exposure`        | Whether the person is exposed to carcinogens at work (`Yes` / `No`)         |
| `Indoor_Pollution`             | Whether the individual is exposed to indoor pollutants (`Yes` / `No`)       |
| `Healthcare_Access`            | Quality of healthcare access (`Good` / `Moderate` / `Poor`)                 |
| `Early_Detection`              | Whether lung cancer was detected early (`Yes` / `No`)                       |
| `Treatment_Type`               | Type of treatment received (`None`, `Surgery`, `Radiation`, etc.)           |
| `Developed_or_Developing`      | Economic status of the country (`Developed` / `Developing`)                 |
| `Annual_Lung_Cancer_Deaths`    | Number of deaths from lung cancer per year in the country                   |
| `Lung_Cancer_Prevalence_Rate` | Percentage of population diagnosed with lung cancer                         |
| `Mortality_Rate`               | Mortality rate due to lung cancer (possibly normalized or % value)          |


## Import Packages

In [1]:
import pandas as pd
import numpy as np 
pd.set_option('display.max_columns', None)

import matplotlib.pyplot as plt
import seaborn as sns

import warnings 
warnings.filterwarnings('ignore')

## Import Data

In [2]:
lcp = pd.read_csv ('lung_cancer_prediction_dataset.csv', index_col='ID')
df= lcp.copy()
display(df.head())  

print('Number of Rows and Columns', df.shape)
print('Dataset Info**:',df.info())
print('\n')
print('Summary Statistics:',display(df.describe().round()))


Unnamed: 0_level_0,Country,Population_Size,Age,Gender,Smoker,Years_of_Smoking,Cigarettes_per_Day,Passive_Smoker,Family_History,Lung_Cancer_Diagnosis,Cancer_Stage,Survival_Years,Adenocarcinoma_Type,Air_Pollution_Exposure,Occupational_Exposure,Indoor_Pollution,Healthcare_Access,Early_Detection,Treatment_Type,Developed_or_Developing,Annual_Lung_Cancer_Deaths,Lung_Cancer_Prevalence_Rate,Mortality_Rate
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1
0,China,1400,80,Male,Yes,30,29,No,No,No,,0,Yes,Low,Yes,No,Poor,No,,Developing,690000,2.44,0.0
1,Iran,84,53,Male,No,0,0,Yes,No,No,,0,Yes,Low,Yes,No,Poor,No,,Developing,27000,2.1,0.0
2,Mexico,128,47,Male,Yes,12,6,Yes,No,No,,0,Yes,Medium,No,No,Poor,Yes,,Developing,28000,1.11,0.0
3,Indonesia,273,39,Female,No,0,0,No,Yes,No,,0,Yes,Low,No,No,Poor,No,,Developing,40000,0.75,0.0
4,South Africa,59,44,Female,No,0,0,Yes,No,No,,0,Yes,Medium,Yes,No,Poor,No,,Developing,15000,2.44,0.0


Number of Rows and Columns (220632, 23)
<class 'pandas.core.frame.DataFrame'>
Index: 220632 entries, 0 to 220631
Data columns (total 23 columns):
 #   Column                       Non-Null Count   Dtype  
---  ------                       --------------   -----  
 0   Country                      220632 non-null  object 
 1   Population_Size              220632 non-null  int64  
 2   Age                          220632 non-null  int64  
 3   Gender                       220632 non-null  object 
 4   Smoker                       220632 non-null  object 
 5   Years_of_Smoking             220632 non-null  int64  
 6   Cigarettes_per_Day           220632 non-null  int64  
 7   Passive_Smoker               220632 non-null  object 
 8   Family_History               220632 non-null  object 
 9   Lung_Cancer_Diagnosis        220632 non-null  object 
 10  Cancer_Stage                 8961 non-null    object 
 11  Survival_Years               220632 non-null  int64  
 12  Adenocarcinoma_Type    

Unnamed: 0,Population_Size,Age,Years_of_Smoking,Cigarettes_per_Day,Survival_Years,Annual_Lung_Cancer_Deaths,Lung_Cancer_Prevalence_Rate,Mortality_Rate
count,220632.0,220632.0,220632.0,220632.0,220632.0,220632.0,220632.0,220632.0
mean,230.0,53.0,8.0,7.0,0.0,63931.0,2.0,3.0
std,349.0,19.0,12.0,10.0,1.0,130690.0,1.0,15.0
min,54.0,20.0,0.0,0.0,0.0,10005.0,0.0,0.0
25%,83.0,36.0,0.0,0.0,0.0,23000.0,1.0,0.0
50%,113.0,53.0,0.0,0.0,0.0,30000.0,2.0,0.0
75%,206.0,69.0,15.0,14.0,0.0,45000.0,2.0,0.0
max,1400.0,85.0,40.0,30.0,10.0,690000.0,2.0,90.0


Summary Statistics: None


In [3]:
df.columns

Index(['Country', 'Population_Size', 'Age', 'Gender', 'Smoker',
       'Years_of_Smoking', 'Cigarettes_per_Day', 'Passive_Smoker',
       'Family_History', 'Lung_Cancer_Diagnosis', 'Cancer_Stage',
       'Survival_Years', 'Adenocarcinoma_Type', 'Air_Pollution_Exposure',
       'Occupational_Exposure', 'Indoor_Pollution', 'Healthcare_Access',
       'Early_Detection', 'Treatment_Type', 'Developed_or_Developing',
       'Annual_Lung_Cancer_Deaths', 'Lung_Cancer_Prevalence_Rate',
       'Mortality_Rate'],
      dtype='object')

## Cleaning Data

In [4]:
print('Number of Nulls:\n', df.isnull().sum(), sep='')

Number of Nulls:
Country                             0
Population_Size                     0
Age                                 0
Gender                              0
Smoker                              0
Years_of_Smoking                    0
Cigarettes_per_Day                  0
Passive_Smoker                      0
Family_History                      0
Lung_Cancer_Diagnosis               0
Cancer_Stage                   211671
Survival_Years                      0
Adenocarcinoma_Type                 0
Air_Pollution_Exposure              0
Occupational_Exposure               0
Indoor_Pollution                    0
Healthcare_Access                   0
Early_Detection                     0
Treatment_Type                 213968
Developed_or_Developing             0
Annual_Lung_Cancer_Deaths           0
Lung_Cancer_Prevalence_Rate         0
Mortality_Rate                      0
dtype: int64


In [5]:
# Identify duplicates and count them
lcp.duplicated().sum()

In [6]:
# Identify duplicates in Index and count them
lcp.index.duplicated().sum()

#### There is no duplicate on the ID number

In [7]:
# Show duplicates 
dup = lcp[lcp.duplicated(keep = False)]
dup.sample(6)

In [8]:
#Cleaning Columns

df['Cancer_Stage'] = df['Cancer_Stage'].fillna('No info')
df['Treatment_Type'] = df['Treatment_Type'].fillna('No info')

# Change column to Boolean
map_dict = {'Yes': 1, 'No': 0}
df['Lung_Cancer_Diagnosis'] = df['Lung_Cancer_Diagnosis'].map(map_dict)
df['Smoker'] = df['Smoker'].map(map_dict)
df['Passive_Smoker'] = df['Passive_Smoker'].map(map_dict)
df['Family_History'] = df['Family_History'].map(map_dict)
df['Early_Detection'] = df['Early_Detection'].map(map_dict)
df['Adenocarcinoma_Type'] = df['Adenocarcinoma_Type'].map(map_dict)
# Add column Gender to Boolean
df['is_male'] = df['Gender'].apply(lambda x: 1 if x == 'Male' else 0)
df.sample(3)

In [9]:
# Find incorrect data entry:
df['start_smoking_age'] = df['Age'] - df['Years_of_Smoking']
print (df['start_smoking_age'].unique())

# Filter  positive values of age:
df_smoking = df[df['start_smoking_age'] >= 0]
print('Start smoking age', sorted(df_smoking['start_smoking_age'].unique()))

## Feature Engineering 

In [10]:
df['Lung_Cancer_Death_rate_in_population'] = df['Annual_Lung_Cancer_Deaths']*100 / (df['Population_Size'] *1000000)

In [11]:
#Function to categorize Smoker Types:

mean_years = df['Years_of_Smoking'].mean()
mean_cigs = df['Cigarettes_per_Day'].mean()


def smoker_type(year, cigarette):
    '''Function categorising Heavy/Light Smokers'''
    
    if cigarette == 0:
        return 'Non Smoker'
    
    else:
        if year > mean_years and cigarette > mean_cigs:
            return 'Longterm and Heavy Smoker'
        elif year > mean_years:
            return 'Longterm Smoker'
        elif cigarette > mean_cigs:
            return 'Heavy Smoker'
        else:
            return 'Light Smoker'
    
df['Smoker_Type'] = df.apply(lambda row: smoker_type(row['Years_of_Smoking'], row['Cigarettes_per_Day']), axis=1)
df.head(3)


## Exploratory Data Analysis/Visualization
This analysis includes:
1. Lung Cancer Diagnosis & Correlations of numeric values
2. Lung cancer diagnosis breakdown by gender/ smoker/ smoker types
3. Lung Cancer Diagnosis by Family History, passive smoker, adenocarcinoma type
4. Lung cancer patients categorised by smoker type for each air pollution exposure split on genders
5. Regional Variations in Lung cancer diagnosis:
   - Lung cancer death rate in population of each country
   - Lung cancer prevalence rate against Developed or Developing Country
   - Pairplot country numeric columns in dataset
   - Mortality rate vs. Lung cancer death rate in population by Country
   - Average mortality rate by countries
   - Top countries of Average mortality rate
   - Lung cancer cases in Ethiopia
   - Lung cancer cases in Nigeria    



In [12]:
df.select_dtypes(include='number').corr()

In [13]:
#Correlations of numeric columns

plt.figure(figsize=(6, 5))
numeric_corr = df[['Age', 'Years_of_Smoking', 'Cigarettes_per_Day', 'Survival_Years',
                   'Annual_Lung_Cancer_Deaths', 'Lung_Cancer_Prevalence_Rate', 'Mortality_Rate']].corr()

sns.heatmap(numeric_corr, cmap='coolwarm', fmt=".2f")
plt.title('Correlation Heatmap of Numeric Features')
plt.tight_layout()
plt.show()


In [14]:
#Split of Cancer and No cancer data

display(df['Lung_Cancer_Diagnosis'].value_counts())

plt.figure(figsize=(3,3))
plt.pie(df['Lung_Cancer_Diagnosis'].value_counts(), labels=df['Lung_Cancer_Diagnosis'].value_counts().index,
       autopct='%1.1f%%', startangle=90, colors=['#107082', '#F0CDA1'], textprops={'fontsize': 12, 'color': 'black'} )
plt.legend(title='Diagnosis', labels=['No Cancer', 'Cancer'], fontsize=12, bbox_to_anchor=(1.05, 1), loc='upper left')
plt.title('Split of Cancer and No Cancer')
plt.show()

In [15]:
# Split lung cancer diagnosis by Gender
df['Lung_Cancer_Diagnosis_str'] = df['Lung_Cancer_Diagnosis'].astype(str)

plt.figure(figsize=(3, 4))
sns.countplot(data=df, x='Gender', hue='Lung_Cancer_Diagnosis_str',  palette=['#107082', '#F0CDA1'])

table = pd.crosstab(df['Gender'], df['Lung_Cancer_Diagnosis_str'], margins=True, margins_name="Total")
display (table)
plt.title('Lung Cancer Diagnosis by Gender')
plt.xlabel('Gender')
plt.ylabel('Count')
plt.legend(title='Diagnosis', labels=['Cancer', 'No cancer'], fontsize=12, bbox_to_anchor=(1.05, 1), loc='upper left')
plt.show()

In [16]:
# Cases of Cancer patients by Gender

lung_diagnosis = df[df['Lung_Cancer_Diagnosis']== 1]

plt.figure(figsize=(3,3))

plt.pie(lung_diagnosis['Gender'].value_counts(), labels=lung_diagnosis['Gender'].value_counts().index,
       autopct='%1.1f%%', startangle=90, colors=['#107082', '#F0CDA1'])
plt.show()

In [17]:
# Cases of Cancer patients by Smoker

lung_diagnosis = df[df['Lung_Cancer_Diagnosis']== 1]
plt.figure(figsize=(3,3))

plt.pie(lung_diagnosis['Smoker'].value_counts(), labels=lung_diagnosis['Smoker'].value_counts().index,
       autopct='%1.1f%%', startangle=90, colors=['#107082', '#F0CDA1'])
plt.legend(title = 'Smoker', labels= ['Yes','No'], loc= 'upper right', bbox_to_anchor = (1.3,1))
plt.show()

In [18]:
# Lung cancer diagnosis by smoker types and gender

grouped_data = lung_diagnosis.groupby(['Gender', 'Smoker_Type']).size().unstack(level=[1])

ax = grouped_data.plot(kind='bar', figsize=(6,5), color=['#E68C14','#A3A3A3','#F0CDA1', '#107082', '#F3E7B3']) 

ax.set_title('Smoker Types by Gender of Lung cancer diagnosis')

ax.set_ylabel('Cases')
ax.legend(title='Smoker Type / Gender', bbox_to_anchor=(1.05, 1), loc='upper left')
ax.tick_params(axis='x', rotation=45)

plt.tight_layout()
plt.show()

In [19]:
# Lung Cancer Diagnosis by Family History, passive smoker, adenocarcinoma type
main_column= 'Adenocarcinoma_Type'
comparison_columns =  ['Passive_Smoker', 'Family_History', 'Gender']

for col in comparison_columns:
    gender_history = df.groupby([main_column, col]).size().reset_index(name='count')
    
    pivot_data = gender_history.pivot(index=main_column, columns=col, values='count')
    
    # Plotting the stacked bar plot
    pivot_data.plot(kind='barh', stacked=True, figsize=(2,1), color = ['#107082', '#F0CDA1'])
    plt.title(f'{main_column} vs {col}')
    plt.xlabel('Cases')
    plt.ylabel('Adenocarcinoma')
    plt.legend(title=col, loc='upper right', bbox_to_anchor=(2, 1))
    plt.xticks(rotation=45)
    plt.show()
    

In [20]:
# Lung cancer patients categorised by smoker type for each air pollution exposure split on genders

sns.violinplot(data=lung_diagnosis, x="Air_Pollution_Exposure", y="Years_of_Smoking", hue="Gender", 
               split=True,  palette=['#107082', '#F0CDA1'], inner="quart")
plt.title('Years of Smoking by Air Pollution Exposure, Split by Gender')
plt.tight_layout()
plt.legend(loc='upper right',  bbox_to_anchor = (1.2,1))
plt.xticks(rotation=0)
plt.show()

In [21]:
# Split of air pollution exposures

plt.figure(figsize=(3,3))

plt.pie(lung_diagnosis['Air_Pollution_Exposure'].value_counts(), labels=lung_diagnosis['Air_Pollution_Exposure'].value_counts().index,
       autopct='%1.1f%%', startangle=90, colors=['#107082', '#F0CDA1', '#A3A3A3'])
plt.show()

**Regional Variations in Lung cancer diagnosis**

In [22]:
# Lung cancer diagnosis by country
diagnosis_by_country = df.groupby(['Country','Developed_or_Developing', 'Lung_Cancer_Diagnosis']).size().unstack(fill_value=0)

# Sort by total count
diagnosis_by_country = diagnosis_by_country.loc[diagnosis_by_country.sum(axis=1).sort_values(ascending=False).index].reset_index()
display(diagnosis_by_country)

# Plot stacked bar chart
diagnosis_by_country.plot(kind='bar', stacked=True, figsize=(12,5), color=['#107082', '#F0CDA1'])
plt.title('Lung Cancer Diagnosis Counts by Country')
plt.xlabel('Country')
plt.ylabel('Number of Cases')
plt.legend(title = 'Diagnosis', loc= 'lower right', labels=[ 'No cancer', 'Cancer'], bbox_to_anchor = (0.8,1))
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

In [23]:
# Dataset in country combine with information of 
country_numeric = df.groupby('Country')[['Population_Size','Smoker','Years_of_Smoking','Survival_Years',
                                 'Lung_Cancer_Prevalence_Rate','Annual_Lung_Cancer_Deaths', 'Mortality_Rate','Lung_Cancer_Death_rate_in_population']].mean().reset_index()

display(diagnosis_by_country.head(3))

country = diagnosis_by_country.merge(country_numeric, on='Country', how='outer')
pd.set_option('display.max_columns', None)
country.head()


In [26]:
# Find if each country has the same value of Annual_Lung_Cancer_Deaths
nodup_countries =[]
for countries in df['Country'].unique():
    nodup_country= df.loc[df['Country'] == countries, 'Annual_Lung_Cancer_Deaths'].unique().shape[0]
    

    if nodup_country >1 :
        nodup_countries.append(countries)

In [27]:
nodup_countries

In [28]:
# Lung cancer death rate in population of each country
color_map = {'Developed': '#107082', 'Developing': '#E68C14'} 

pop_size = country['Population_Size']/3
plt.figure(figsize=(9, 6))
scatter = plt.scatter(country['Country'], country['Lung_Cancer_Death_rate_in_population'], s= pop_size, c=country['Developed_or_Developing'].map(color_map), alpha=0.6, edgecolors="w", linewidth=3)
plt.title('Lung cancer death rate in population of each country')
plt.xlabel('Country')
plt.ylabel('Lung_Cancer_Death_rate_in_population')
plt.xticks(rotation=60)

handles = [plt.Line2D([0], [0], marker='o', color='w', label='Developed', 
               markerfacecolor='#107082', markersize=10), plt.Line2D([0], [0], marker='o', color='w', label='Developing', markerfacecolor='#E68C14', markersize=10)]

plt.legend(handles=handles, bbox_to_anchor = (1.25,1))


plt.grid(True, linestyle='--', linewidth=0.5, alpha=0.7)

plt.annotate('Myanmar Reports Multiple \n Annual Cancer Death Values', xy=(14, 0.065), xytext=(17, 0.057),arrowprops=dict(facecolor='red', arrowstyle='->'),  
    fontsize=8, color='red',
    horizontalalignment='center',
    verticalalignment='bottom')
plt.tight_layout()
plt.show()

In [29]:
#Lung cancer prevalence rate against Developed or Developing Country

developed = df[df['Developed_or_Developing'] == 'Developed']
developing = df[df['Developed_or_Developing'] == 'Developing']
display (developed.head(2))
plt.figure(figsize=(7, 5))
plt.hist(developed['Lung_Cancer_Prevalence_Rate'], bins=30, alpha=0.6,
         color='#107082', edgecolor='black', label='Developed')
plt.hist(developing['Lung_Cancer_Prevalence_Rate'], bins=30, alpha=0.6,
         color ='#F0CDA1', edgecolor='black', label='Developing')

plt.title('Lung cancer prevalence rate (Percentage of population diagnosed with lung cancer) \n in Developed vs Developing Countries')
plt.xlabel('Lung cancer prevalence rate')
plt.ylabel('Cases')
plt.legend(bbox_to_anchor=(1.05, 1))
plt.tight_layout()
plt.show()


In [None]:
# Pairplot country dataset
plt.figure(figsize=(12, 12))
sns.set(style="white")
sns.pairplot(country, hue='Country', palette='husl',  corner=True)

plt.legend(bbox_to_anchor=(0.6, 1), fontsize=18)
plt.tight_layout()
plt.show()

In [124]:
# Mortality rate vs. Lung cancer death rate in population by Country
plt.figure(figsize=(6, 5))
sns.scatterplot(x='Mortality_Rate', y='Lung_Cancer_Prevalence_Rate', data = country,
                style='Country', hue='Country', palette='husl', s=100)

plt.title('Mortality Rate vs. Lung Cancer Prevalence Rate by Country')
plt.xlabel('Mortality_Rate')
plt.ylabel('Lung_Cancer_Prevalence_Rate')
plt.legend(title='Country', bbox_to_anchor=(1.15, 1))
plt.annotate('Nigeria', xy=(3.215, 1.511), xytext=(3.27, 1.511),arrowprops=dict(facecolor='red', arrowstyle='->'),  
    fontsize=8, color='black',
    horizontalalignment='center',
    verticalalignment='bottom')

plt.annotate('Ethiopia', xy=(3.426, 1.5035), xytext=(3.426, 1.505),arrowprops=dict(facecolor='red', arrowstyle='->'),  
    fontsize=8, color='black',
    horizontalalignment='center',
    verticalalignment='bottom')

plt.show()

In [48]:
# Average mortality rate by countries

mort_rate = df[df['Mortality_Rate']>0]
plt.figure(figsize=(8, 4))

sns.histplot(country['Mortality_Rate'], bins=20, kde=True, color='#107082', edgecolor='black')
plt.title('Distribution of Average Mortality Rate Across Countries')
plt.xlabel('Mortality Rate')
plt.ylabel('Number of Countries')

plt.show()


In [50]:
# Top countries of Average mortality rate

top_mortality = country.sort_values('Mortality_Rate', ascending=False).head(10)

plt.figure(figsize=(9, 4))
colours = plt.cm.viridis(np.linspace(0, 1, 10))
sns.barplot(data=top_mortality, x='Mortality_Rate', y='Country', palette = colours)

plt.title('Top 10 Countries by Average Mortality Rate')
plt.xlabel('Average Mortality Rate')
plt.ylabel('Country')

plt.show()


In [105]:
# Lung cancer cases in Ethiopia

ethiopia_data= lung_diagnosis[lung_diagnosis['Country']== 'Ethiopia']

plt.figure(figsize=(6, 6))
sns.lmplot(data= ethiopia_data, x='Lung_Cancer_Prevalence_Rate', y='Years_of_Smoking',hue = 'Adenocarcinoma_Type')

plt.title('Lung cancer prevalence rate vs. Years of smoking \n in Adenocarcinoma type group in Ethiopia')
plt.xlabel('Lung cancer prevalence rate')
plt.ylabel('Years of Smoking')


plt.show()

In [103]:
# Lung cancer cases in Ethiopia

ethiopia_data= lung_diagnosis[lung_diagnosis['Country']== 'Ethiopia']

plt.figure(figsize=(6, 6))
sns.lmplot(data= ethiopia_data, x='Survival_Years', y='Years_of_Smoking',hue = 'Cancer_Stage')

plt.title('Survival years vs. Years of smoking \n in Adenocarcinoma type group in Ethiopia')
plt.xlabel('Survival years')
plt.ylabel('Years of Smoking')

plt.show()

In [129]:
# Lung cancer cases in Nigeria

nigeria_data= lung_diagnosis[lung_diagnosis['Country']== 'Nigeria']

plt.figure(figsize=(6, 6))
palette = sns.color_palette("viridis", 2)

sns.lmplot(data= nigeria_data, x='Lung_Cancer_Prevalence_Rate', y='Years_of_Smoking',hue = 'Adenocarcinoma_Type', palette=palette)

plt.title('Lung cancer prevalence rate vs. Years of smoking \n in Adenocarcinoma type group in Nigeria')
plt.xlabel('Lung cancer prevalence rate')
plt.ylabel('Years of Smoking')


plt.show()

In [131]:
# Lung cancer cases in Nigeria

nigeria_data= lung_diagnosis[lung_diagnosis['Country']== 'Nigeria']

plt.figure(figsize=(6, 6))
palette = sns.color_palette("viridis", 4)

sns.lmplot(data= nigeria_data, x='Survival_Years', y='Years_of_Smoking',hue = 'Cancer_Stage', palette=palette)

plt.title('Survival years vs. Years of smoking \n in Adenocarcinoma type group in Nigeria')
plt.xlabel('Survival years')
plt.ylabel('Years of Smoking')


plt.show()