## Karatu Session Exam Project
#### Instructions:
You will be provided with a dataset in CSV format.
Your tasks are to write Python functions that accomplish the following:
1. Read the dataset into a pandas Data Frame.
*Extra marks if you handle exceptions gracefully and print meaningful error messages.*
2. Sum the missing values for each column.
           *Ensure the function returns a pandas Series that lists the missing values for each column.* 
3. Check for duplicates in each column.
*Ensure the function returns a dictionary with column names and boolean values indicating the presence of duplicates.*


#### Reading dataset into a Pandas DataFrame

In [3]:
# Import pandas library for data manipulation and analysis
import pandas as pd

# Create function to read dataset into pandas dataframe
def read_dataset(file_name: str):
    """
    Read dataset in comma-separated values (csv) format into a dataframe.
    
    Parameters:
    filename: url or file-like object
    filename type: str

    Returns:
    DataFrame: A comma-separated values (csv) file returned as two-dimensional
    data structure.
    """
    # Try and Except statement to handle function error
    try:
        # Attempt to read the dataset into a Pandas dataframe
        file_name = "titanic.csv"
        project_df = pd.read_csv(file_name)             
        return project_df
        # Except statement return specified error messages should function call fail
    except FileNotFoundError:
        print(f"Error message: File '{file_name}' not found")
    except Exception as e:
        print(f"Error message: An error occurred while reading the dataset: {e}") # adopting f-string formatting to highlight full detail of error

# Confirm dataframe creation using conditional statement
file_name = "titanic.csv"
ExamProject_df = read_dataset(file_name) 
if ExamProject_df is not None:      
    print("""Karatu first semester exam project dataset read successfully into a pandas dataframe\n
Data cleaning, entailing analysis of missing and duplicate values according to assessment instruction will subsequently be carried out""")
else:
    print("troubleshoot 'read_dataset' function")
    

Karatu first semester exam project dataset read successfully into a pandas dataframe

Data cleaning, entailing analysis of missing and duplicate values according to assessment instruction will subsequently be carried out


In [4]:
# Call function to read dataset into dataframe   
read_dataset("titanic.csv") 

# Assign variable name to function output
ExamProject_df = read_dataset("titanic.csv") 

In [5]:
# View dataset in pandas dataframe
ExamProject_df.head(10)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.075,,S
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C


In [6]:
# Data overview
print("Shape of the exam project (titanic dataset) dataframe", ExamProject_df.shape)

Shape of the exam project (titanic dataset) dataframe (891, 12)


#### Function to sum missing values for each column

In [8]:
# Create function to sum missing values in each column of dataframe

def sum_missing_values(dataframe):
    """
    Calculate and return the sum of missing values per column in dataframe.

    Parameters:
    param dataframe: The input dataframe.
    type dataframe: Pandas (pd)
    axis : {0 or 'index', 1 or 'columns'}

    Returns:    
    A series containing arithmetic sum of the missing values per column.
    
    rtype: Series
    """
    # Sum the missing values in each column and produce a pandas series
    missing_values_sum = dataframe.isnull().sum()
    return missing_values_sum

# Call function to produce sum of missing value as a pandas series
missing_values_series = sum_missing_values(ExamProject_df)
print(f"Sum of missing values in exam project dataframe as a pandas series: \n{missing_values_series}")  # f-string formatting


Sum of missing values in exam project dataframe as a pandas series: 
PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64


#### Function to check for duplicates in each column

In [10]:
# Create function to check for duplicates in dataframe

def check_for_duplicates(dataframe):
    """
    Identify and return duplicate entries per column in dataframe.

    Parameters:
    param dataframe: The input dataframe.
    type dataframe: Pandas (pd)
    axis : {0 or 'index', 1 or 'columns'}
    
    Returns: 
    Duplicate values in dataframe
    
    rtype: Dictionary
    """
    # Create an empty dictionary to store column names and boolean values
    duplicates_dict = {}
    
    # Iterate through each column in the dataframe
    for col in dataframe.columns:
        # Check for duplicates in each column and store the result as a boolean value in the dictionary
        has_duplicates = dataframe[col].duplicated().any()
        duplicates_dict[col] = has_duplicates
    
    return duplicates_dict

# Calling function to identify duplicates within dataframe
duplicates_result = check_for_duplicates(ExamProject_df)
print(f"The duplicates in the dataframe are stored in the dictionary: \n{duplicates_result}")  # using f-string formatting


The duplicates in the dataframe are stored in the dictionary: 
{'PassengerId': False, 'Survived': True, 'Pclass': True, 'Name': False, 'Sex': True, 'Age': True, 'SibSp': True, 'Parch': True, 'Ticket': True, 'Fare': True, 'Cabin': True, 'Embarked': True}


In [11]:
# Assign duplicated values in dataframe to a variable 
# and view as column name and boolean value
duplicate_values_dict = duplicates_result
duplicate_values_dict


{'PassengerId': False,
 'Survived': True,
 'Pclass': True,
 'Name': False,
 'Sex': True,
 'Age': True,
 'SibSp': True,
 'Parch': True,
 'Ticket': True,
 'Fare': True,
 'Cabin': True,
 'Embarked': True}

In [None]:
# Adopting user input function to end exam project (run and try it out)
def end_project():
    user_input = input("Does my project align with the assessment objectives outlined? (yes/no): ")
    if user_input.lower() == "yes":
        print("Thank you for your efforts to review and commend my project, thank you!!!")
    elif user_input.lower() == "no":
        print("Thank you for reading through, and please do kindly comment your observations, thank you!!!")
    else:
        print("Kindly input valid response (yes/no)")

# Call function to prompt User for input
end_project()