# 1.Project Title: [Data Classificcation]
___

#### a. Introduction

- **Objective:** Clearly state the goal of your project. What problem are you trying to solve?
- **Background:** Provide context on why this problem is important or interesting. Mention any relevant research, datasets, or industry relevance.
- **Scope:** Define the boundaries of your project. What will be included, and what will be out of scope?

#### b. Project Overview

- **Project Summary:** A brief overview of the project, including the main steps you will take to achieve the objective.
- **Milestones:** Outline the key milestones or phases of the project. For example:
  - Data Collection
  - Data Preprocessing
  - Model Selection
  - Model Training and Evaluation
  - Results and Conclusion


#### c. About the Author

- **Name:** [Ahmed Ferganey]
- **Background:** Junior Data Scientist and Machine Learning Engineer with a strong foundation in embedded systems, industrial engineering, and supply chain management. Knowledgeable in statistical analysis, NLP, Computer Vision, and deep learning, with hands-on experience in Python, SQL, and Docker.
- **Motivation:** Why are you interested in this project? What do you hope to learn or achieve?
- **Contact:** [LinkedIn acc](https://www.linkedin.com/in/ahmed-ferganey/)



#### d. Tools and Technologies

- **Programming Languages:** List the programming languages you will use (e.g., Python).
- **Libraries and Frameworks:** List the specific libraries and frameworks you will use (e.g., TensorFlow, scikit-learn).
- **Software and Tools:** Mention any software or tools necessary for the project (e.g., Jupyter Notebook, Git).

#### e. Dataset Description

- **Dataset Name:** [Name of the Dataset]
- **Source:** Where did you obtain the dataset? Include a link if possible.
- **Description:** Briefly describe the dataset, including the number of features, the target variable, and any other important details.
- **Data Preprocessing:** Outline any preprocessing steps you anticipate, such as data cleaning, normalization, or feature engineering.

#### f. Methodology

- **Model Selection:** Describe the types of models you are considering and why.
- **Evaluation Metrics:** Define how you will evaluate your models' performance (e.g., accuracy, F1-score).
- **Validation Strategy:** Explain how you will validate your models, such as cross-validation or a 


### 2. importing libraries
___



In [1]:
import os 
import io
import sys
import pandas as pd
pd.set_option('display.max_columns', None)
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set(style="whitegrid")
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import SelectPercentile , f_classif ,SelectKBest
from sklearn.feature_selection import chi2 , f_classif 
import plotly.express as px
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression,SGDClassifier
from sklearn.naive_bayes import GaussianNB,BernoulliNB
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis,QuadraticDiscriminantAnalysis
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.decomposition import PCA
from tqdm import tqdm

### 3. reading the raw data
___

In [4]:
# Path to the main directory
path_ = r'/media/ahmed-ferganey/AI4/01-Learning_AI/MyGitHub/Machine_Learning_Projects/CSV_Files/data_new'

# Paths for saving files
merged_output_path = r'/media/ahmed-ferganey/AI4/01-Learning_AI/MyGitHub/Machine_Learning_Projects/CSV_Files/data_new/merged_output.csv'
headered_output_path = r'/media/ahmed-ferganey/AI4/01-Learning_AI/MyGitHub/Machine_Learning_Projects/CSV_Files/data_new/DataReading.csv'

# Lists to store data and file details
merged_data = []
all_joint_positions_files = {}
all_labels_files = {}
deleted_files = []
subfolder_number = {}

In [None]:
# Traverse the main directory
for folder in tqdm(os.listdir(path_)):
    subfolder_path = os.path.join(path_, folder)
    
    # Check if it's a directory
    if os.path.isdir(subfolder_path):
        # Count and store the number of items in this subfolder
        subfolder_number[folder] = len(os.listdir(subfolder_path))
        
        # Traverse each subdirectory in this subfolder
        for subfolder in os.listdir(subfolder_path):
            final_folder_path = os.path.join(subfolder_path, subfolder)
            
            # Check if it's a directory
            if os.path.isdir(final_folder_path):
                joint_positions_exists = False
                labels_exists = False
                joint_positions_file = None
                labels_file = None
                
                for file in os.listdir(final_folder_path):
                    full_file_path = os.path.join(final_folder_path, file)
                    
                    if file == 'Joint_Positions.csv':
                        joint_positions_file = full_file_path
                        joint_positions_exists = True
                    elif file == 'Labels.csv':
                        labels_file = full_file_path
                        labels_exists = True
                    else:
                        # Delete any file that is not 'Joint_Positions.csv' or 'Labels.csv'
                        os.remove(full_file_path)
                        deleted_files.append(full_file_path)
                        print(f"Deleted: {full_file_path}")
                
                # Print an error message if one of the files is missing
                if not joint_positions_exists:
                    print(f"Error: The path '{final_folder_path}' does not include the file 'Joint_Positions.csv'")
                if not labels_exists:
                    print(f"Error: The path '{final_folder_path}' does not include the file 'Labels.csv'")
                
                # Calculate rows and ratios if both files are present
                if joint_positions_exists and labels_exists:
                    try:
                        joint_df = pd.read_csv(joint_positions_file, header=None)
                        labels_df = pd.read_csv(labels_file, header=None)
                        joint_rows = len(joint_df)
                        labels_rows = len(labels_df)
                        
                        ratio = joint_rows / labels_rows if labels_rows > 0 else None
                        all_joint_positions_files[joint_positions_file] = joint_rows
                        all_labels_files[labels_file] = labels_rows
                        
                        print(f"File Pair: {joint_positions_file} and {labels_file}")
                        print(f"Joint_Positions.csv rows: {joint_rows}")
                        print(f"Labels.csv rows: {labels_rows}")
                        print(f"Ratio (Joint_Positions / Labels): {ratio:.2f}" if ratio is not None else "Error: Division by zero")
                        
                        # Process each row in labels file
                        for i in range(len(labels_df)):
                            output_row = labels_df.iloc[i].values
                            input_rows = joint_df.iloc[i * 25:(i + 1) * 25]
                            input_data = input_rows.values.flatten()
                            
                            # Concatenate input data (75 columns) with output data
                            merged_row = list(output_row) + list(input_data) + [folder, subfolder]
                            merged_data.append(merged_row)
                        
                    except Exception as e:
                        print(f"Error reading files in directory '{final_folder_path}': {e}")
        
        print('-----------------------------------\n')

In [None]:
# Print summary
print(f"\nSummary:")
print(f"Total Labels.csv files retained: {len(all_labels_files)}")
print(f"Total Joint_Positions.csv files retained: {len(all_joint_positions_files)}")
print(f"Total files deleted: {len(deleted_files)}")

In [None]:
# Create DataFrame and save to CSV
try:
    # Define headers
    header = ['OUTPUT']  # First column header

    # Add Sample headers
    for i in range(1, 26):  # For 25 samples
        header.extend([f'Sample_{i}_X', f'Sample_{i}_Y', f'Sample_{i}_Z'])
    
    # Add new columns
    header.extend(['Main_Folder', 'Sub_Folder'])
    
    # Create DataFrame and assign header
    merged_df = pd.DataFrame(merged_data)
    merged_df.columns = header
    
    # Save the DataFrame with header to a new CSV file
    merged_df.to_csv(headered_output_path, index=False)
    print(f"Merged file with header saved to: {headered_output_path}")
except Exception as e:
    print(f"Error saving merged file: {e}")

In [None]:
DataReadingFrame = pd.read_csv(headered_output_path)
DataReadingFrame

In [None]:
DataReadingFrame['OUTPUT'].unique()

In [None]:
DataReadingFrame['OUTPUT'].value_counts()

In [None]:
DataReadingFrame['Main_Folder'].value_counts()

In [None]:
DataReadingFrame['Sub_Folder'].value_counts()

### 4. data analysis
___

### 5. data cleaning
___

##### 5.1 finding nulls

##### 5.2 outliers

##### 5.3 feature extraction

##### 5.4 feature selection

### 6. visualization
___

### 7. building the model
___

### 8. evaluation the model
___