# Data Ingestion
This module handles the downloading, extraction, and loading of data files for the NFL Game Competition project.
## Overview 
The `data_ingestion.py` script is responsible for:
1. Downloading a zipped dataset from a specified URL.   
2. Unzipping the downloaded file to a designated directory.
3. Loading the data from the unzipped files into pandas DataFrames.
## Key Components
- **DataIngestionConfig**: A configuration class that holds parameters for data ingestion, such as source URL, local file paths, and directories.
- **DataIngestionArtifact**: A class that encapsulates the paths to the ingested data files.
- **DataIngestion**: The main class that implements methods for downloading, unzipping, and loading data.
## Usage
1. **Initialization**: Create an instance of `DataIngestion` by passing a `DataIngestionConfig` object.
    ```python
    config = DataIngestionConfig(
         source_URL="https://example.com/data.zip",
         local_data_zipped_file="path/to/save/data.zip",
         unzip_dir="path/to/unzip"
    )
    data_ingestion = DataIngestion(config=config)
    ```
2. **Data Ingestion**: Call the `initiate_data_ingestion` method to perform the entire ingestion process.
    ```python   
    artifact = data_ingestion.initiate_data_ingestion()
    print(artifact)
    ```
## Integration Steps    
To integrate the data ingestion module into your project, follow these steps:
**1. Define the configuration parameters in `config.yaml` and create a corresponding `DataIngestionConfig` class in `src/nfl_game_competition/entity/config_entity.py`.**
**2. Implement the `DataIngestion` class in `src/nfl_game_competition/components/data_ingestion.py` as shown above.**
**3. Create an `Artifact` class in `src/nfl_game_competition/entity/artifact_entity.py` to encapsulate the output of the data ingestion process.**
**4. Update the main pipeline in `src/nfl_game_competition/pipeline/` to include the data ingestion step.**
**5. Ensure that logging and exception handling are properly implemented for robustness.**
## Next Steps
After successfully ingesting the data, the next steps typically involve:
- Data validation and schema enforcement.
- Data preprocessing and feature engineering.
- Model training and evaluation.
## Note
This module is designed to be modular and easily integrated into larger data processing pipelines. Adjust the paths and parameters as needed to fit your project's structure and requirements.  
## How to Use This Template
This template provides a structured approach to setting up a data science project. Follow these steps to effectively utilize the template:
**1. Clone the repository and set up your environment.**
- Install necessary dependencies listed in `requirements.txt`.

In [1]:
import os

In [2]:
# check current working directory 
%pwd

'd:\\kaggle\\nfl_game_competition\\notebooks'

In [3]:
# go back to root directory, which : nfl_game_competition
os.chdir("../")
# not check again 
%pwd

'd:\\kaggle\\nfl_game_competition'

### step 1: `src/nfl_game_competition/constants/__init__.py` 

In [4]:
from pathlib import Path

CONFIG_FILE_PATH=Path("config/config.yaml")
PARAMS_FILE_PATH= Path("params.yaml")
SCHEMA_FILE_PATH=Path("schema.yaml")

### step 1.2: `src/nfl_game_competition/constants/training_pipeline_constants/__init__.py`

In [5]:
"""
Defining common constant variable for training pipeline.
"""
ARTIFACT_DIR: str = 'artifacts'
PIPELINE_NAME: str = 'nfl_game_competition'

TRAIN_FILE_NAME: str= "train.csv"
TEST_FILE_NAME: str = "test.csv"

SCHEMA_FILEPATH: str = os.path.join('data_schema', 'schema.yaml')

SAVED_MODEL_DIR: str = os.path.join("models", "saved_models")
FILE_NAME: str = "merged_data.csv"
"""Data Ingestion Step 1: 
This is step 1: here 'src/nfl_game_competition/constants/training_pipeline_constants/__init__.py'
Next step 2 will be in 'src/nfl_game_competition/entity/config_entity.py' to create DataIngestionConfig class

Data Ingestion related constant start with DATA_INGESTION var name.
"""
DATA_INGESTION_COLLECTION_NAME: str = "NFLGameCompetitionData"
DATA_INGESTION_SOURCE_URL: str = "nfl-big-data-bowl-2026-prediction"
DATA_INGESTION_DIR_NAME: str = "data_ingestion"
DATA_INGESTION_FEATURE_STORE_DIR: str = "feature_store"
DATA_INGESTION_INGESTED_DIR: str = "ingested"
DATA_INGESTION_ZIPPED_FILE_DIR: str = "zipped_data"
DATA_INGESTION_UNZIPPED_DIR_NAME: str = "unzipped_data"
DATA_INGESTION_SPLITTED_DIR: str = "train_test_split"

### step 2: `src/nfl_game_competition/entity/config_entity.py`

In [None]:
from dataclasses import dataclass
from pathlib import Path
from datetime import datetime
#from src.nfl_game_competition.constants import training_pipeline_constants as training_pipeline
import os

# add comment for each line that tells what codeline does
# Training Pipeline Configuration
class TrainingPipelineConfig:
    def __init__(self, timestamp: str = datetime.now()): 
        self.timestamp: str = timestamp.strftime("%d_%m_%Y_%H_%M_%S")   # format timestamp,  set timestamp
        self.pipeline_name = PIPELINE_NAME              # set pipeline name
        self.artifact_name = ARTIFACT_DIR              # set artifact name
        self.artifact_dir = os.path.join(self.artifact_name, self.timestamp)   # set artifact directory
        self.model_dir = os.path.join(self.artifact_dir, "final_model")  # set model directory

In [7]:
@dataclass
class DataIngestionConfig:
    def __init__(self, training_pipeline_config: TrainingPipelineConfig):
        self.data_ingestion_dir: str = os.path.join(training_pipeline_config.artifact_dir,
                                                    DATA_INGESTION_DIR_NAME)  # data ingestion directory  
        # feature store file path as(artifacts/data_ingestion/feature_store/nfl_game_features.csv)
        self.feature_store_filepath: str = os.path.join(self.data_ingestion_dir, 
                                                        DATA_INGESTION_FEATURE_STORE_DIR,
                                                        FILE_NAME)  # feature store file path
        # training file path as(artifacts/data_ingestion/ingested/train_test_split/train.csv)
        self.training_filepath: str = os.path.join(self.data_ingestion_dir, 
                                                   DATA_INGESTION_INGESTED_DIR,
                                                   DATA_INGESTION_SPLITTED_DIR,
                                                   TRAIN_FILE_NAME) # training file path 
        self.testing_filepath: str = os.path.join(self.data_ingestion_dir, 
                                                  DATA_INGESTION_INGESTED_DIR,
                                                  DATA_INGESTION_SPLITTED_DIR,
                                                  TEST_FILE_NAME)  # test filepath
        # create variable for source url
        self.data_source_url: str = DATA_INGESTION_SOURCE_URL  # data source url
        # create variable for collection name
        self.data_collection_name: str = DATA_INGESTION_COLLECTION_NAME  # data collection

        # downloaded data zip file path
        # artifacts/data_ingestion/ingested/data.zip
        self.zipped_data_filepath: str = os.path.join(self.data_ingestion_dir,
                                                      DATA_INGESTION_INGESTED_DIR,
                                                      DATA_INGESTION_ZIPPED_FILE_DIR)  # zipped data file path
        
        # unzipped data directory path
        # artifacts/data_ingestion/ingested/unzipped_data
        self.unzipped_data_dir: str = os.path.join(self.data_ingestion_dir,
                                                   DATA_INGESTION_INGESTED_DIR,
                                                   DATA_INGESTION_UNZIPPED_DIR_NAME)  # unzipped data directory

### step 2.1: `src/nfl_game_competition/entity/artifact_entity.py`

In [8]:
from dataclasses import dataclass


@dataclass
class DataIngestionArtifact:
    feature_store_file_path: str
    train_file_path: str
    test_file_path: str

### step 3: `src/nfl_game_competition/component/data_ingestion.py`

In [None]:
# step 3 now , data ingestion component
"""
# we have to create data ingestion class in which there multiple fuction , 
# 1) get data from kaggleapi based on provided competition name as source_url mentioned in DataIngestionConfig class, and save it into 
zipped_data_filepath 
2) unzip the data and print all file names (which we can use their pattern in glob to concat input or woutpur wise) and save it into unzipped_data_dir



"""
import os
import sys
#from src.nfl_game_competition.entity.config_entity import DataIngestionConfig, TrainingPipelineConfig
from src.nfl_game_competition.exception import NFLGameCompetitionException
from src.nfl_game_competition.logger import get_logger

from kaggle.api.kaggle_api_extended import KaggleApi
from zipfile import ZipFile
import glob
# import utils fuction to create directory 
from src.nfl_game_competition.utils.common import create_directories

logger = get_logger(name="1_data_ingestion_notebook")

# we have to create data ingestion class in which there multiple fuction , 
# 1) get data from kaggleapi based on provided competition name as source_url mentioned in DataIngestionConfig class, and save it into 
# zipped_data_filepath  
#2) unzip the data and save it into unzipped_data_dir

class DataIngestion:
    def __init__(self, data_ingestion_config: DataIngestionConfig):
        try:
            self.data_ingestion_config = data_ingestion_config
            self.kaggle_api = KaggleApi()
            self.kaggle_api.authenticate()
        except Exception as e:
            raise NFLGameCompetitionException(e, sys) from e
        
    
    def download_nfl_data(self) -> str:
        """
        Downloads the NFL data from Kaggle using the Kaggle API and saves it to the specified zipped data filepath.
        Returns the path to the downloaded zip file.
        use competition_download_files method of KaggleApi class
        1) competition: str - The name of the competition to download data from.
        2) path: str - The directory where the downloaded files will be saved.
        use create_dirctory function to create directory if not exist
        
        """
        try:
            logger.info("Starting data download from kaggle")
            # create directory if not exist
            #print(self.data_ingestion_config.zipped_data_filepath)

            create_directories([self.data_ingestion_config.zipped_data_filepath])
            # download the data from kaggle
            self.kaggle_api.competition_download_files(competition=self.data_ingestion_config.data_source_url,
                                                       path=self.data_ingestion_config.zipped_data_filepath)
                                                       
            logger.info(f"Data downloaded successfully from kaggle and saved to {self.data_ingestion_config.zipped_data_filepath}")
            return self.data_ingestion_config.zipped_data_filepath
        except Exception as e:
            raise NFLGameCompetitionException(e, sys) from e
        
    # unzip downloaded data and store to unzipped_data_dir , fuction return unzipped_data_dir path
    def unzip_and_extract_data(self, zip_file_path: str, extract_dir: str) -> Path:
        """
        Unzips the downloaded NFL data zip file and extracts its contents to the specified directory.
        Returns the path to the directory where the files were extracted.
        Args:
            1) zip_file_path: str - The path to the zip file to be extracted.
            2) extract_dir: str - The directory where the extracted files will be saved.
        use create_dirctory function to create directory if not exist
        return:
            pathlib.Path: The path to the directory where the files were extracted.
            also return two strings where one str tells pattern for training input data, another str tells pattern for target/output data
        """
        try:
            logger.info(f"Starting to unzip and extract data from {zip_file_path} to {extract_dir}")
            # create directory if not exist
            create_directories([extract_dir])
            # unzip the data
            # find zip file name in zip_file_path

            print("zip file name: ", self.data_ingestion_config.data_source_url)
            with ZipFile(f"{zip_file_path}/{self.data_ingestion_config.data_source_url}.zip", 'r') as zip_ref:
                zip_ref.extractall(extract_dir)
            logger.info(f"Data unzipped and extracted successfully to {extract_dir}")
            return Path(extract_dir)
        except Exception as e:
            raise NFLGameCompetitionException(e, sys) from e

    def initiate_data_ingestion(self):
        try:
            logger.info(f"{'>>'*20} Data Ingestion {'<<'*20}")
            zip_file_path = self.download_nfl_data()
            unzip_dir = self.unzip_and_extract_data(zip_file_path, self.data_ingestion_config.unzipped_data_dir)
            logger.info(f"Data Ingestion - data downloded done: {unzip_dir}")
            #full_df = self.load_data_multiple_files(unzip_dir)
            #train_set, test_set = self.train_test_splitting(full_df)
            #train_filepath = self.export_data_into_feature_store(train_set, "train")
            #test_filepath = self.export_data_into_feature_store(test_set, "test")
            #data_ingestion_aftifacts = DataIngestionArtifact(trained_filepath=train_filepath,
                                                             #test_filepath=test_filepath)
            #logger.info(f"Data Ingestion Completed. Artifacts: {data_ingestion_aftifacts}")
            return unzip_dir
        except Exception as e:
            raise NFLGameCompetitionException(e, sys)

    

### step 4: `src/nfl_game_competition/pipelines/train_pipeline.py`

In [None]:
from src.nfl_game_competition.entity.artifact_entity import DataIngestionArtifact 
#from src.nfl_game_competition.entity.config_entity import (TrainingPipelineConfig, DataIngestionConfig) 
#from src.nfl_game_competition.components.data_ingestion import DataIngestion


logger = get_logger(name="nfl_game_competition/pipelines/train_pipeline.py")
class TrainingPipeline:
    def __init__(self):
        self.training_pipeline_config = TrainingPipelineConfig()

    def start_data_ingestion(self) -> DataIngestionArtifact:
        try:
            self.data_ingestion_config = DataIngestionConfig(training_pipeline_config=self.training_pipeline_config)
            data_ingestion = DataIngestion(config=self.data_ingestion_config)
            unzip_dir = data_ingestion.initiate_data_ingestion()
            #logger.info(f"Data Ingestion Artifact: {data_ingestion_artifact}")
            return unzip_dir
        except Exception as e:
            raise NFLGameCompetitionException(e, sys)

    # run pipeline
    def run_pipeline(self):
        try:
            data_ingestion_artifact = self.start_data_ingestion()
            # data_validation_artifact = self.start_data_validation(data_ingestion_artifact=data_ingestion_artifact)
            # data_transformation_artifact = self.start_data_transformation(data_validation_artifact=data_validation_artifact)
            # model_trainer_artifact = self.start_model_training(data_transformation_artifact=data_transformation_artifact)
            # model_evaluation_artifact = self.start_model_evaluation(data_validation_artifact=data_validation_artifact,
            #                                                        model_trainer_artifact=model_trainer_artifact)
            # model_pusher_artifact = self.start_model_pusher(model_evaluation_artifact=model_evaluation_artifact)
            # return model_pusher_artifact
            return data_ingestion_artifact
            
        except Exception as e:
            raise NFLGameCompetitionException(e, sys)

In [None]:
# check if the pipeline is running
if __name__ == "__main__":
    train_pipeline = TrainingPipeline()
    train_pipeline.run_pipeline()

In [19]:
# check if everything is fine
if __name__ == "__main__":
    try:
        training_pipeline_config = TrainingPipelineConfig()
        data_ingestion_config = DataIngestionConfig(training_pipeline_config=training_pipeline_config)
        #print(data_ingestion_config.__dict__)

        data_ingestion = DataIngestion(data_ingestion_config=data_ingestion_config)
        zip_file_path = data_ingestion.download_nfl_data()
        print(f"Downloaded zip file path: {zip_file_path}")
        
        extracted_path = data_ingestion.unzip_and_extract_data(zip_file_path=zip_file_path,
                                                              extract_dir=data_ingestion_config.unzipped_data_dir)
        #print(f"Extracted data directory: {extracted_path}")

        # List all files in the extracted directory
        #all_files = glob.glob(os.path.join(extracted_path, '*.csv'), recursive=True)
        print("List of all files in the extracted directory:")
        #for file in all_files:
           # print(file)

    except Exception as e:
        raise NFLGameCompetitionException(e, sys) from e

artifacts\10_15_2025_16_09_13\data_ingestion\ingested\zipped_data
Downloaded zip file path: artifacts\10_15_2025_16_09_13\data_ingestion\ingested\zipped_data
zip file name:  nfl-big-data-bowl-2026-prediction
List of all files in the extracted directory:


In [None]:
# check data downloading by kaggle api

import os
from kaggle.api.kaggle_api_extended import KaggleApi  # Import Kaggle API

# Step 1: Authenticate using kaggle.json file (must be in ~/.kaggle)
api = KaggleApi()
api.authenticate()  # This reads your ~/.kaggle/kaggle.json file for username & key

# Step 2: Define the competition name
competition_name = "nfl-big-data-bowl-2026-prediction"  # replace with your actual competition name

# Step 3: Define a local directory to save the data
download_dir = "data_sample"  # folder where data will be downloaded
os.makedirs(download_dir, exist_ok=True)  # create folder if not exists

# Step 4: Download all competition files as ZIP
api.competition_download_files(competition_name, path=download_dir)  # downloads and saves zip file

print(f"✅ Download complete! Files are saved in: {download_dir}")


✅ Download complete! Files are saved in: data_sample
