<a href="https://colab.research.google.com/github/apoorvapu/data_science/blob/main/BigData.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Big Data

1. Out-of-Core Processing (Processing Without Loading Everything in RAM)
These tools allow you to process datasets that don't fit in memory.

Pandas Alternatives for Large Datasets is Dask: Parallel computing library that extends pandas and NumPy to work on large datasets. Example: Using Dask Instead of Pandas

import dask.dataframe as dd

df = dd.read_csv('large_dataset.csv')  # Lazy loading

print(df.head())  # Operations happen in parallel

2. Distributed Computing (Scaling Beyond One Machine)
For big data processing, distributed frameworks like Apache Spark work well.

PySpark (Apache Spark in Python): Handles datasets TBs in size. Works with HDFS, S3, and cloud storage.

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("BigData").getOrCreate()

df = spark.read.csv("large_dataset.csv", header=True, inferSchema=True)

3. Cloud & External Storage for Large Data
If your local system has limited RAM, you can process data in cloud-based storage:

Google Colab + Google Drive/BigQuery (for structured data)

Hugging Face Datasets (for large ML datasets)

from google.colab import auth

from google.cloud import bigquery

auth.authenticate_user()

client = bigquery.Client()

query = "SELECT * FROM `bigquery-public-data.ml_datasets` LIMIT 1000"

df = client.query(query).to_dataframe()

4. AI Modeling on Large Datasets:

Incremental Learning (Training Models in Chunks)

Scikit-learn’s partial_fit() (for models like logistic regression, SVM)

TensorFlow/Keras fit_generator() (for large image/text datasets)

from sklearn.linear_model import SGDClassifier

clf = SGDClassifier()  # Stochastic Gradient Descent

for chunk in pd.read_csv("large_dataset.csv", chunksize=10000):
    clf.partial_fit(chunk.drop("target", axis=1), chunk["target"], classes=[0, 1])


5. Deep Learning on Large Data: Use TensorFlow/Keras with Data Generators:

from tensorflow.keras.preprocessing.image import ImageDataGenerator

datagen = ImageDataGenerator().flow_from_directory('large_dataset', batch_size=32)

model.fit(datagen, epochs=10)