### Step 1: Import Required Libraries
Start by importing the necessary Python libraries:

In [1]:
import pandas as pd
import numpy as np

### Step 2: Locate an Open Source Dataset
You can choose a dataset from platforms like Kaggle, UCI Machine Learning Repository, or Open Data Portal. Some popular datasets include the Titanic dataset, Iris dataset, or Wine Quality dataset. For this example, let’s assume you’re using the "Titanic" dataset from Kaggle.

**Dataset Link:** [Titanic Dataset on Kaggle](https://www.kaggle.com/c/titanic/data)

### Step 3: Describe the Dataset
The Titanic dataset contains data about passengers who were on the Titanic, including details such as age, gender, ticket class, fare, etc. The goal is often to predict survival based on these features.

### Step 4: Load the Dataset
Use pandas to load the dataset into a DataFrame.

In [4]:
# Assuming the CSV file is named 'train.csv'
df = pd.read_csv('F:/myPortfolio/Programming/Masters in Science - Computer Science/20241105-Practicals-Big_Data_Analysis/titanic/train.csv')

### Step 5: Initial Data Preprocessing
To understand the data better, check for missing values and initial statistics.

In [5]:
# Check for missing values
print(df.isnull().sum())

# Get initial statistics
print(df.describe())

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64
       PassengerId    Survived      Pclass         Age       SibSp  \
count   891.000000  891.000000  891.000000  714.000000  891.000000   
mean    446.000000    0.383838    2.308642   29.699118    0.523008   
std     257.353842    0.486592    0.836071   14.526497    1.102743   
min       1.000000    0.000000    1.000000    0.420000    0.000000   
25%     223.500000    0.000000    2.000000   20.125000    0.000000   
50%     446.000000    0.000000    3.000000   28.000000    0.000000   
75%     668.500000    1.000000    3.000000   38.000000    1.000000   
max     891.000000    1.000000    3.000000   80.000000    8.000000   

            Parch        Fare  
count  891.000000  891.000000  
mean     0.381594   32.204208  
std      0.806057   49.693429  


This will output basic statistics like mean, median, min, max, and count for numeric variables. You can also use `df.info()` to get an overview of the data types and missing values.

**Variable Descriptions:**
- `Survived`: Survival (0 = No, 1 = Yes)
- `Pclass`: Ticket class (1 = 1st, 2 = 2nd, 3 = 3rd)
- `Name`, `Sex`, `Age`, etc. provide demographic and fare information.



### Step 6: Check Data Dimensions

In [6]:
# Dimensions of the DataFrame
print(df.shape)

(891, 12)


### Step 7: Data Formatting and Normalization
Check and update data types if necessary.

In [7]:
# Check data types
print(df.dtypes)

# Convert categorical variables to appropriate types, if necessary
df['Pclass'] = df['Pclass'].astype('category')
df['Survived'] = df['Survived'].astype('category')

PassengerId      int64
Survived         int64
Pclass           int64
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Cabin           object
Embarked        object
dtype: object


### Step 8: Convert Categorical Variables into Quantitative Variables
Convert categorical variables (e.g., `Sex` and `Embarked`) into numerical values using pandas `get_dummies()` or by encoding them.

In [8]:
# Convert 'Sex' column into numeric values
df['Sex'] = df['Sex'].map({'male': 0, 'female': 1})

# Convert 'Embarked' column into dummy variables
df = pd.get_dummies(df, columns=['Embarked'], drop_first=True)


These steps provide a clear pathway for loading, preprocessing, and formatting the dataset. You can adjust these commands as necessary depending on the dataset you choose.