# Developing an LLM application using Langchain

## Task 1: Import Libraries

In [1]:
import google.generativeai as genai #part of Google’s Generative AI tools
from IPython.display import display #displaying rich outputs in Jupyter Notebooks
from IPython.display import Markdown #display text in Markdown format within Jupyter notebooks
import textwrap #text manipulation, especially wrapping and formatting text output
from langchain_community.document_loaders import PyPDFLoader  #load and parse content from PDF documents
from langchain_text_splitters.character import RecursiveCharacterTextSplitter #split the documents into smaller chunks for efficient processing
from langchain_google_genai import GoogleGenerativeAIEmbeddings #integrates Google’s embedding models with LangChain. 
from langchain_google_genai import ChatGoogleGenerativeAI #extension of LangChain that helps interact with Google’s generative AI chat models
from langchain.vectorstores import Chroma #vector database designed for faster and more efficient similarity search.
from langchain.chains import RetrievalQA #function is tailored to build question-answering systems using RAG. 


genai.configure(api_key='')
model = genai.GenerativeModel('gemini-pro')


## Task 2: Ask The Questions Using Prompts

In [2]:
response = model.generate_content("Explain Generative AI in 3 Bullet Points")
print(response.text)
Markdown(response.text)

I0000 00:00:1727249601.311462     114 config.cc:230] gRPC experiments enabled: call_status_override_on_cancellation, event_engine_dns, event_engine_listener, http2_stats_fix, monitoring_experiment, pick_first_new, trace_record_callops, work_serializer_clears_time_cache


* **Generates new content or data from scratch:** Generative AI uses algorithms to create novel and unique text, images, audio, or code, without relying on existing data.
* **Learns from large datasets:** It is trained on vast amounts of data to understand patterns and relationships, enabling it to generate content that is realistic, coherent, and knowledgeable.
* **Offers creative and innovative solutions:** Generative AI empowers humans to explore new ideas, generate fresh perspectives, and enhance creative output by automating the creation of novel content.


* **Generates new content or data from scratch:** Generative AI uses algorithms to create novel and unique text, images, audio, or code, without relying on existing data.
* **Learns from large datasets:** It is trained on vast amounts of data to understand patterns and relationships, enabling it to generate content that is realistic, coherent, and knowledgeable.
* **Offers creative and innovative solutions:** Generative AI empowers humans to explore new ideas, generate fresh perspectives, and enhance creative output by automating the creation of novel content.

## Task 3: Chat With Gemini And Retrieve The Chat History

In [3]:
#Start chat and store it in the history
hist = model.start_chat()

#Get Response
response = hist.send_message("Hi! Proivide a recipe to make a margeritta pizza from scratch")

#Get Markdown Output
Markdown(response.text)

#Get all items from history
#Contains parts, and role objects
#role can be user or model
for item in hist.history:
    print(item)
    print("\n\n")

#First object in item.parts
item.parts[0].text

# Count the number of tokens
model.count_tokens("Now provide the location of the nearest supermarket where I can buy the ingredients from.")


parts {
  text: "Hi! Proivide a recipe to make a margeritta pizza from scratch"
}
role: "user"




parts {
  text: "**Ingredients:**\n\n**For the Dough:**\n\n* 2 1/2 cups (300g) all-purpose flour, plus more for dusting\n* 1 teaspoon (5g) active dry yeast\n* 1 teaspoon (5g) sugar\n* 1 teaspoon (5g) kosher salt\n* 1 cup (240ml) warm water (105-115°F / 40-46°C)\n\n**For the Pizza:**\n\n* 1/2 cup (120ml) tomato sauce or crushed tomatoes\n* 1 ball (225g) fresh mozzarella cheese, sliced thinly\n* 1/2 cup (15g) freshly grated Parmesan cheese\n* 1/2 cup (20g) fresh basil leaves, torn into small pieces\n* Olive oil, for greasing\n\n**Instructions:**\n\n**To Make the Dough:**\n\n1. In a large bowl, combine the flour, yeast, sugar, and salt.\n2. Add the warm water and stir until the dough forms a sticky ball.\n3. Turn the dough out onto a lightly floured surface and knead for 5-7 minutes until it becomes smooth and elastic.\n4. Form the dough into a ball, place it in a lightly oiled bowl, and cov

total_tokens: 16

## Task 4: Experiment With The Temperature Parameter

In [4]:
def get_response(prompt, config):
    response = model.generate_content(contents=prompt, generation_config=config )
    return response

for temp in [0.0, 0.25, 0.5, 0.75, 1.0]:
    config = genai.types.GenerationConfig(temperature=temp)
    result = get_response("Explain the concepts of XGBoost and Random Forest with real-life use cases", config)

    print(f"\n\n For temperature {temp}, the results are: \n\n")
    display(Markdown(result.text))



 For temperature 0.0, the results are: 




**XGBoost (Extreme Gradient Boosting)**

**Concept:**
XGBoost is a powerful machine learning algorithm that combines the principles of gradient boosting and decision trees. It builds an ensemble of decision trees, where each tree is trained on a weighted version of the training data. The weights are adjusted based on the errors made by the previous trees, ensuring that subsequent trees focus on correcting the mistakes of their predecessors.

**Real-Life Use Cases:**
* **Fraud Detection:** XGBoost can identify fraudulent transactions by analyzing patterns in financial data.
* **Customer Churn Prediction:** It can predict the likelihood of customers leaving a service by considering factors such as usage history and demographics.
* **Recommendation Systems:** XGBoost can recommend products or services to users based on their past preferences and interactions.

**Random Forest**

**Concept:**
Random Forest is an ensemble learning algorithm that combines multiple decision trees. Each tree is trained on a different subset of the training data and a random subset of features. The final prediction is made by combining the predictions of all the individual trees.

**Real-Life Use Cases:**
* **Image Classification:** Random Forest can classify images into different categories, such as animals, objects, or scenes.
* **Natural Language Processing:** It can be used for tasks such as text classification, sentiment analysis, and spam detection.
* **Medical Diagnosis:** Random Forest can assist in diagnosing diseases by analyzing patient data, such as symptoms, medical history, and test results.

**Comparison:**

* **Accuracy:** Both XGBoost and Random Forest are highly accurate algorithms. However, XGBoost tends to perform slightly better on complex datasets.
* **Speed:** XGBoost is generally faster than Random Forest, especially for large datasets.
* **Interpretability:** Random Forest is more interpretable than XGBoost, as it is easier to understand the individual decision trees that make up the ensemble.
* **Hyperparameter Tuning:** XGBoost requires more hyperparameter tuning than Random Forest, which can be time-consuming.

**Conclusion:**

XGBoost and Random Forest are both powerful machine learning algorithms with a wide range of applications. XGBoost is particularly well-suited for tasks that require high accuracy and speed, while Random Forest is more appropriate when interpretability is important. By understanding the strengths and weaknesses of each algorithm, data scientists can select the best tool for their specific problem.



 For temperature 0.25, the results are: 




**XGBoost**

**Concept:** XGBoost (Extreme Gradient Boosting) is a machine learning algorithm that combines multiple decision trees into a single, highly accurate model. It uses a gradient boosting approach, where each tree is trained on the residuals of the previous tree, resulting in a cumulative improvement in predictive performance.

**Real-Life Use Case:**

* **Predicting Customer Churn:** XGBoost can be used to predict which customers are likely to churn (stop using a service). By identifying these customers, businesses can implement targeted retention strategies to minimize churn rates.

**Random Forest**

**Concept:** Random Forest is an ensemble learning algorithm that combines multiple decision trees. Each tree is trained on a different subset of the data and makes predictions independently. The final prediction is determined by combining the predictions of all the individual trees.

**Real-Life Use Case:**

* **Image Classification:** Random Forest can be used to classify images into different categories. By training the algorithm on a large dataset of labeled images, it can learn to recognize patterns and make accurate predictions.

**Comparison of XGBoost and Random Forest**

| Feature | XGBoost | Random Forest |
|---|---|---|
| Model Type | Gradient Boosting | Ensemble of Decision Trees |
| Regularization | Yes | Yes |
| Feature Importance | Yes | Yes |
| Computational Complexity | Higher | Lower |
| Accuracy | Typically higher | Typically lower |
| Interpretability | Lower | Higher |

**When to Use XGBoost vs. Random Forest**

* Use XGBoost when you need high accuracy and are willing to trade off interpretability for performance.
* Use Random Forest when you need a more interpretable model and are willing to sacrifice some accuracy.



 For temperature 0.5, the results are: 




**XGBoost (Extreme Gradient Boosting)**

XGBoost is a scalable and efficient gradient boosting algorithm that combines the power of multiple decision trees to create a robust and accurate prediction model.

**Key Concepts:**

* **Gradient Boosting:** XGBoost uses a series of decision trees to iteratively improve the prediction accuracy. Each tree focuses on correcting the errors made by previous trees.
* **Regularization:** XGBoost employs regularization techniques to prevent overfitting, such as L1 and L2 regularization.
* **Parallelization:** XGBoost can be parallelized for faster training on large datasets.

**Real-Life Use Cases:**

* **Fraud Detection:** Identifying fraudulent transactions by analyzing historical data and patterns.
* **Credit Risk Assessment:** Predicting the likelihood of loan default based on borrower characteristics and financial history.
* **Stock Market Prediction:** Forecasting stock prices by considering market conditions, company fundamentals, and historical trends.

**Random Forest**

Random Forest is an ensemble learning algorithm that combines multiple decision trees to create a more accurate and robust model.

**Key Concepts:**

* **Bagging:** Random Forest generates multiple decision trees using different subsets of the data and features.
* **Random Feature Selection:** Each tree is constructed using a random subset of features, reducing the impact of correlated variables.
* **Majority Voting:** The final prediction is determined by the majority vote of the individual decision trees.

**Real-Life Use Cases:**

* **Image Classification:** Classifying images into different categories, such as animals, objects, or scenes.
* **Natural Language Processing:** Identifying sentiment, extracting keyphrases, and performing text classification.
* **Medical Diagnosis:** Predicting diseases or patient outcomes based on medical data, such as symptoms, test results, and medical history.

**Comparison**

Both XGBoost and Random Forest are powerful machine learning algorithms, but they have different strengths and weaknesses:

* **Accuracy:** XGBoost generally has higher accuracy than Random Forest, especially for complex and large datasets.
* **Interpretability:** Random Forest is more interpretable than XGBoost, as it is easier to understand the contributions of individual decision trees.
* **Computational Cost:** XGBoost can be more computationally expensive than Random Forest, especially for large datasets.



 For temperature 0.75, the results are: 




**XGBoost (Extreme Gradient Boosting)**

XGBoost is a gradient boosting algorithm that combines multiple decision trees to make predictions. It is known for its speed, accuracy, and ability to handle large datasets.

**Use Case:** Predicting customer churn. A telecom company can use XGBoost to predict which customers are likely to cancel their subscriptions. By identifying these customers, the company can take proactive measures to retain them.

**Random Forest**

Random Forest is an ensemble learning algorithm that combines multiple decision trees. Each tree is trained on a different subset of the data and makes predictions independently. The final prediction is determined by combining the predictions of all the trees.

**Use Case:** Detecting credit card fraud. A financial institution can use Random Forest to detect fraudulent transactions. The algorithm can learn from historical data to identify patterns that are indicative of fraud, such as unusual spending behavior or suspicious IP addresses.

**Key Differences**

* **Regularization:** XGBoost uses a more advanced regularization technique called "tree pruning" to prevent overfitting. Random Forest relies on bagging and feature randomization to reduce overfitting.
* **Speed:** XGBoost is generally faster than Random Forest, especially for large datasets.
* **Hyperparameter Tuning:** XGBoost has more hyperparameters to tune, which can make it more difficult to optimize.
* **Interpretability:** Random Forest is generally more interpretable than XGBoost, as it is easier to understand the individual decision trees and their contributions to the final prediction.

**Advantages of XGBoost and Random Forest**

* **High Accuracy:** Both algorithms can achieve high accuracy on a wide range of tasks.
* **Robustness:** They are relatively insensitive to noise and outliers in the data.
* **Scalability:** They can handle large datasets efficiently.
* **Feature Importance:** They provide insights into the importance of different features in making predictions.

**Disadvantages of XGBoost and Random Forest**

* **Overfitting:** Both algorithms can overfit the data if not properly tuned.
* **Interpretability:** XGBoost can be less interpretable than Random Forest, especially for complex models.
* **Computation Time:** Hyperparameter tuning can be time-consuming, especially for XGBoost.



 For temperature 1.0, the results are: 




**XGBoost**

XGBoost (Extreme Gradient Boosting) is a machine learning algorithm designed for extremely fast execution and high accuracy. It belongs to the family of gradient boosting algorithms that combine multiple weak learners (e.g., decision trees) into a stronger, more accurate model.

**Concepts:**

* **Tree Ensembles:** XGBoost uses a sequence of decision trees arranged hierarchically, boosting each tree's predictions based on the errors of previous trees.
* **Regularization:** XGBoost includes regularization terms to prevent overfitting and improve model generalization.
* **Parallelization:** XGBoost is parallelizable, allowing for efficient computation even on large datasets.

**Use Cases:**

* **Fraud Detection:** Detecting fraudulent transactions by identifying deviations from normal spending patterns.
* **Customer Churn Prediction:** Predicting customers who are likely to cancel their subscriptions based on historical data and usage patterns.
* **Image Classification:** Categorizing images into predefined classes, such as animals, vehicles, or objects.

**Random Forest**

Random Forest is an ensemble machine learning algorithm that combines multiple decision trees to improve stability and accuracy. It operates by selecting a random subset of features and data points for each tree, reducing correlation between trees and minimizing overfitting.

**Concepts:**

* **Decision Trees:** Random Forest builds a collection of decision trees using a random subsample of the training data.
* **Voting:** Each tree independently makes predictions, and the final prediction is typically determined by a majority vote.
* **Bagging:** Random Forest uses bagging (bootstrap aggregation), where multiple samples are randomly drawn from the training data to construct individual trees.

**Use Cases:**

* **Financial Forecasting:** Predicting financial performance metrics, such as stock prices or company revenues.
* **Medical Diagnosis:** Assisting medical professionals in diagnosing diseases by analyzing patient data and symptoms.
* **Object Detection:** Identifying and localizing objects in images or videos.

**Comparison**

XGBoost and Random Forest offer similar capabilities but differ in certain aspects:

* **Speed:** XGBoost is generally faster than Random Forest, especially on large datasets.
* **Regularization:** XGBoost has built-in regularization features, while Random Forest requires manual hyperparameter tuning to prevent overfitting.
* **Parallelization:** XGBoost can be parallelized across multiple cores, allowing for faster training.
* **Feature Selection:** Random Forest supports feature selection, while XGBoost typically requires an external selection process.

## Task 5: Experiment With Maximum Output Tokens

In [7]:
def get_response(prompt, generation_config={}):
    response = model.generate_content(contents=prompt, generation_config=generation_config)
    return response
for max_out_tok in [1, 50, 100, 150, 200]:
    config = genai.types.GenerationConfig(max_output_tokens=max_out_tok)
    result = get_response("Explain the concepts of XGBoost and Random Forest with real-life use cases", generation_config=config)

    print(f"\n\nFor max output token value {max_out_tok}, the results are: \n\n")
    display(Markdown(result.text))




For max output token value 1, the results are: 




**



For max output token value 50, the results are: 




**XGBoost (Extreme Gradient Boosting)**

XGBoost is an advanced ensemble machine learning algorithm that combines multiple decision trees. It uses a technique called gradient boosting, which sequentially adds decision trees to a model to improve its accuracy.

**



For max output token value 100, the results are: 




**XGBoost**

XGBoost is an ensemble machine learning algorithm that uses gradient boosting to iteratively build a series of decision trees. Each subsequent tree is trained to predict the residual errors of the previous tree. XGBoost is known for its speed, accuracy, and ability to handle large datasets.

**Real-life use cases:**

* **Credit risk assessment:** Predicting the likelihood of a borrower defaulting on a loan.
* **Fraud detection:** Identifying fraudulent transactions in financial



For max output token value 150, the results are: 




**Concept of XGBoost and Random Forest**

**XGBoost (eXtreme Gradient Boosting):**
* An ensemble learning algorithm that combines multiple decision trees to create a powerful predictive model.
* Uses gradient boosting, where each tree learns from the errors of previous trees.
* Highly efficient and effective for large datasets.

**Random Forest:**
* An ensemble learning algorithm that combines multiple decision trees.
* Each tree in the forest is trained on a different subset of the training data and a different subset of features.
* The final prediction is made by majority vote or averaging the predictions of individual trees.

**Real-Life Use Cases**

**XGBoost:**

* **Retail Recommendation:** Predicting product recommendations



For max output token value 200, the results are: 




**XGBoost (Extreme Gradient Boosting)**

**Concept:** XGBoost is an ensemble machine learning algorithm that combines multiple weak learners (decision trees) into a powerful, predictive model. It uses a gradient boosting framework to optimize the objective function iteratively, minimizing the error at each step.

**Real-Life Use Cases:**

* **Fraud Detection:** Identifying fraudulent transactions in financial data by analyzing customer behavior.
* **Healthcare Risk Prediction:** Predicting the risk of developing certain diseases based on medical history and lifestyle factors.
* **Natural Language Processing:** Enhancing text classification and sentiment analysis by capturing non-linear relationships in text.

**Random Forest**

**Concept:** Random Forest is an ensemble learning method that constructs multiple randomized decision trees. Each tree is trained on a subset of the training data, and their predictions are combined to make a final decision.

**Real-Life Use Cases:**

* **Image Classification:** Classifying images into different categories, such as people,

## Task 6: Experiment With the top_k Parameter

In [8]:
def get_response(prompt, generation_config={}):
    response = model.generate_content(contents=prompt, 
    generation_config=generation_config)
    return response

for k in [1, 4, 16, 32, 40]:
    config = genai.types.GenerationConfig(top_k=k)
    result = get_response("Explain the concepts of XGBoost and Random Forest with real-life use cases", generation_config=config)

    print(f"\n\nFor top k value {k}, the results are: \n\n")
    display(Markdown(result.text))




For top k value 1, the results are: 




**XGBoost** (Extreme Gradient Boosting) and **Random Forest** are two popular ensemble machine learning algorithms used for supervised learning tasks, particularly for classification and regression problems.

**XGBoost**

* **Concept:** XGBoost is a gradient boosting algorithm that builds a series of decision trees. It uses a regularized objective function to prevent overfitting and improves the accuracy by optimizing each tree sequentially.
* **Key Features:**
    * Regularization: XGBoost uses L1 and L2 regularization to penalize complex models and reduce overfitting.
    * Tree Pruning: It uses a technique called "pruning" to remove less important branches from the decision trees, improving model efficiency.
    * Parallel Processing: XGBoost can be parallelized to train models on large datasets efficiently.
* **Use Cases:**
    * Credit Scoring: Predicting the probability of default on a loan.
    * Customer Churn: Identifying customers at risk of leaving a service.
    * Stock Price Prediction: Forecasting stock prices based on historical data.

**Random Forest**

* **Concept:** Random Forest is an ensemble algorithm that combines multiple decision trees. It constructs each tree using a different subset of data and features, reducing the variance and improving the overall accuracy.
* **Key Features:**
    * Feature Randomization: Random Forest randomly selects a subset of features for each tree, increasing diversity among the trees.
    * Bootstrap Sampling: Each tree is trained on a bootstrapped sample of the original dataset, further reducing overfitting.
    * Majority Voting: For classification tasks, the output of Random Forest is determined by the majority vote of the individual trees.
* **Use Cases:**
    * Spam Detection: Filtering out spam emails from a dataset.
    * Image Recognition: Classifying images into different categories.
    * Natural Language Processing: Identifying topics or sentiments in text data.

**Comparison:**

* **Accuracy:** Both algorithms are known for their high accuracy, with XGBoost being slightly more accurate in general.
* **Interpretability:** Random Forest is more interpretable than XGBoost, as it is easier to visualize and understand the decision-making process of individual trees.
* **Hyperparameter Tuning:** XGBoost has more hyperparameters to tune, which can be both an advantage and a disadvantage.
* **Computational Complexity:** XGBoost is computationally more expensive than Random Forest, especially for larger datasets.

**Conclusion:**

XGBoost and Random Forest are powerful ensemble algorithms that offer high accuracy and robustness for supervised learning tasks. The choice between the two depends on factors such as the size of the dataset, the level of interpretability required, and the computational resources available.



For top k value 4, the results are: 




**XGBoost (Extreme Gradient Boosting)**

**Concept:** XGBoost is an ensemble machine learning algorithm based on gradient boosting. It trains multiple decision trees sequentially, where each subsequent tree focuses on correcting the errors of the previous ones. XGBoost optimizes the trees' structure and weights to minimize the overall loss function.

**Real-Life Use Cases:**

* Predicting customer churn: Identifying customers at risk of leaving a service or subscription.
* Detecting fraudulent transactions: Flagging suspicious credit card or financial transactions.
* Personalizing recommendations: Tailoring product or content recommendations to individual users.

**Random Forest**

**Concept:** Random forest is an ensemble machine learning algorithm that combines multiple decision trees. Each tree is trained on a different subset of the data and a random selection of features. The final prediction is made by combining the predictions of all the trees.

**Real-Life Use Cases:**

* Classifying medical images: Identifying different types of diseases or tissues in medical scans.
* Predicting stock prices: Forecasting the future value of stocks based on historical data and market trends.
* Spam detection: Distinguishing between legitimate and spam emails.

**Comparison between XGBoost and Random Forest**

| Feature | XGBoost | Random Forest |
|---|---|---|
| Regularization | Supports regularization techniques like L1 and L2 | Supports regularization techniques like L1 and L2 |
| Hyperparameter Tuning | Requires careful hyperparameter tuning | Requires less hyperparameter tuning |
| Computational Cost | More computationally intensive | Less computationally intensive |
| Performance | Generally higher accuracy and predictive power | Generally lower accuracy and predictive power |
| Interpretability | Less interpretable compared to Random Forest | More interpretable compared to XGBoost |

**Choice of Algorithm:**

* **XGBoost** is preferred for high-dimensional data, complex problems, and when interpretability is not a major concern.
* **Random Forest** is preferred for problems where interpretability is important, such as medical diagnosis or patient risk assessment.



For top k value 16, the results are: 




## XGBoost

XGBoost (Extreme Gradient Boosting) is an ensemble machine learning algorithm that combines multiple decision trees to create a more powerful and accurate model. It is a popular technique for solving a wide range of classification and regression problems.

**Key Concepts of XGBoost:**

* **Boosting:** A technique where multiple weak learners (decision trees) are combined to create a strong learner.
* **Gradient Boosting:** A specific type of boosting that uses the gradient of the loss function to update the subsequent trees.
* **Regularization:** A technique that prevents overfitting by penalizing complex models.

**Real-Life Use Cases:**

* **Predicting Credit Risk:** XGBoost can be used to predict the likelihood of a borrower defaulting on a loan.
* **Fraud Detection:** It can help identify fraudulent transactions by analyzing patterns in historical data.
* **Sales Forecasting:** XGBoost can be used to predict future sales based on historical data and various factors.

## Random Forest

Random Forest is an ensemble learning algorithm that combines multiple decision trees. Each tree is trained on a different subset of the data, and the final prediction is made by combining the predictions of all the trees.

**Key Concepts of Random Forest:**

* **Bagging:** A technique where multiple decision trees are trained on different subsets of the data to reduce variance.
* **Feature Subsetting:** Random Forest randomly selects a subset of features to use in each tree, improving model generalization.

**Real-Life Use Cases:**

* **Image Classification:** Random Forest can be used to classify images into different categories, such as animals, objects, and scenes.
* **Spamb Filtering:** It can help identify and filter spam emails based on patterns in the content and sender information.
* **Medical Diagnosis:** Random Forest can assist in diagnosing medical conditions by analyzing patient data and symptoms.

## Comparison of XGBoost and Random Forest

| Feature       | XGBoost       | Random Forest         |
|---|---|---|
| Tree Structure    | Boosting            | Bagging                 |
| Feature Selection | Regularization         | Feature Subsetting      |
| Computational Cost | Usually higher         | Usually lower          |
| Interpretability  | Lower          | Higher                 |
| Accuracy     | Generally higher         | Generally lower          |



For top k value 32, the results are: 




**XGBoost**

**Concept:**
XGBoost (eXtreme Gradient Boosting) is a machine learning algorithm that combines multiple decision trees to create a more accurate and robust model. It uses a technique called gradient boosting, where each tree is built to correct the errors of the previous trees by predicting the residual errors from the previous model.

**Real-Life Use Case:**
* **Predicting Credit Risk:** XGBoost can help banks assess the risk of potential borrowers by analyzing their financial data, credit history, and spending patterns. The model can predict the probability of loan default, enabling banks to make informed decisions about loan approvals and interest rates.

**Random Forest**

**Concept:**
Random Forest is an ensemble learning algorithm that combines multiple decision trees. Each tree is trained on a subset of the training data, and the final prediction is made by combining the outputs of all the trees. The randomness in the algorithm helps prevent overfitting and improves generalization.

**Real-Life Use Case:**
* **Image Classification:** Random Forest can be used to classify images into different categories, such as animals, vehicles, or landscapes. The algorithm analyzes the features of the image, such as color, texture, and shape, to determine its probable category.

**Comparison**

| Feature | XGBoost | Random Forest |
|---|---|---|
| Model Type | Gradient Boosting | Ensemble of Decision Trees |
| Regularization | Built-in | Randomness from Subsampling |
| Hyperparameter Tuning | More complex | Easier |
| Computational Cost | Higher | Lower |
| Performance | Generally higher accuracy | Similar or lower accuracy |

**Conclusion**

Both XGBoost and Random Forest are powerful machine learning algorithms with wide applications. XGBoost is known for its high accuracy and robustness, while Random Forest is simpler to implement and tune. The choice between the two algorithms depends on the specific task and the desired level of accuracy and computational efficiency.



For top k value 40, the results are: 




**XGBoost (Extreme Gradient Boosting)**

XGBoost is a supervised machine learning algorithm that combines multiple decision trees to make predictions. It is an ensemble method, meaning it combines the outputs of several weak learners to create a stronger learner.

**Concepts:**

* **Gradient Boosting:** XGBoost builds decision trees sequentially, using the gradient of the loss function to determine the optimal split at each tree node.
* **Regularization:** XGBoost employs regularization techniques to prevent overfitting, such as L1 and L2 regularization.
* **Tree Pruning:** XGBoost uses a technique called "tree pruning" to reduce the size and complexity of the trees, improving performance and interpretability.

**Real-Life Use Case:** Predicting Customer Churn

* A telecom company wants to predict which customers are likely to churn (cancel their service).
* XGBoost can be used to build a model that combines information about customers' usage history, demographics, and account details.
* The model can be used to identify high-risk customers and implement targeted retention strategies.

**Random Forest**

Random Forest is an ensemble method that combines multiple decision trees into a single model. It operates by randomly selecting a subset of features and data points to train each individual tree.

**Concepts:**

* **Bootstrapping:** Random Forest uses bootstrapping, where each tree is trained on a different subset of the data, increasing the diversity of the ensemble.
* **Feature Bagging:** Random Forest performs feature bagging, where each tree is trained on a different subset of features, reducing the impact of correlated features.
* **Aggregation:** The final prediction is made by combining the predictions of the individual trees, usually through majority voting or averaging.

**Real-Life Use Case:** Image Classification

* An online photo-sharing platform wants to automatically classify images into different categories.
* Random Forest can be used to build a model that combines information about pixel values, textures, and shapes.
* The model can be used to classify images into various categories, such as animals, landscapes, or portraits.

**Comparison:**

* Both XGBoost and Random Forest are ensemble methods that combine multiple decision trees.
* XGBoost is typically more accurate and efficient than Random Forest, especially for larger datasets.
* Random Forest is easier to interpret and implement due to its simpler algorithm.
* The choice between XGBoost and Random Forest depends on the specific problem and the available resources.

## Task 7: Experiment With the top_p Parameter

In [9]:
def get_response(prompt, generation_config={}):
    response = model.generate_content(contents=prompt, 
    generation_config=generation_config)
    return response

for p in [0, 0.2, 0.4, 0.8, 1]:
    config = genai.types.GenerationConfig(top_p=p)
    result = get_response("Explain the concepts of XGBoost and Random Forest with real-life use cases", generation_config=config)

    print(f"\n\nFor top p value {temp}, the results are: \n\n")
    display(Markdown(result.text))




For top p value 1.0, the results are: 




**XGBoost (Extreme Gradient Boosting)**

**Concept:**
XGBoost is a powerful machine learning algorithm that combines the principles of gradient boosting and decision trees. It builds an ensemble of decision trees, where each tree is trained on a weighted version of the training data. The weights are adjusted based on the errors made by the previous trees, ensuring that subsequent trees focus on correcting the mistakes of their predecessors.

**Real-Life Use Cases:**
* **Fraud Detection:** XGBoost can identify fraudulent transactions by analyzing patterns in financial data.
* **Customer Churn Prediction:** It can predict the likelihood of customers leaving a service by considering factors such as usage history and demographics.
* **Recommendation Systems:** XGBoost can recommend products or services to users based on their past preferences and interactions.

**Random Forest**

**Concept:**
Random Forest is an ensemble learning algorithm that combines multiple decision trees. Each tree is trained on a different subset of the training data and a random subset of features. The final prediction is made by combining the predictions of all the individual trees.

**Real-Life Use Cases:**
* **Image Classification:** Random Forest can classify images into different categories, such as animals, objects, or scenes.
* **Natural Language Processing:** It can be used for tasks such as text classification, sentiment analysis, and spam detection.
* **Medical Diagnosis:** Random Forest can assist in diagnosing diseases by analyzing patient data, such as symptoms, medical history, and test results.

**Comparison:**

* **Accuracy:** Both XGBoost and Random Forest are highly accurate algorithms. However, XGBoost tends to perform slightly better on complex datasets.
* **Speed:** XGBoost is generally faster than Random Forest, especially for large datasets.
* **Interpretability:** Random Forest is more interpretable than XGBoost, as it is easier to understand the individual decision trees that make up the ensemble.
* **Hyperparameter Tuning:** XGBoost requires more hyperparameter tuning than Random Forest, which can be time-consuming.

**Conclusion:**

XGBoost and Random Forest are both powerful machine learning algorithms with a wide range of applications. XGBoost is particularly well-suited for tasks that require high accuracy and speed, while Random Forest is more appropriate when interpretability is important. By understanding the strengths and weaknesses of each algorithm, data scientists can select the best tool for their specific problem.



For top p value 1.0, the results are: 




**XGBoost (Extreme Gradient Boosting)**

**Concept:**
XGBoost is a powerful machine learning algorithm that combines the principles of gradient boosting and decision trees. It builds an ensemble of decision trees, where each tree is trained on a weighted version of the training data. The weights are adjusted based on the errors made by the previous trees, ensuring that subsequent trees focus on correcting the mistakes of their predecessors.

**Real-Life Use Cases:**
* **Fraud Detection:** XGBoost can identify fraudulent transactions by analyzing patterns in financial data.
* **Customer Churn Prediction:** It can predict the likelihood of customers leaving a service by considering factors such as usage history and demographics.
* **Recommendation Systems:** XGBoost can recommend products or services to users based on their past preferences and interactions.

**Random Forest**

**Concept:**
Random Forest is an ensemble learning algorithm that combines multiple decision trees. Each tree is trained on a different subset of the training data and a random subset of features. The final prediction is made by combining the predictions of all the individual trees.

**Real-Life Use Cases:**
* **Image Classification:** Random Forest can classify images into different categories, such as animals, objects, or scenes.
* **Natural Language Processing:** It can be used for tasks such as text classification, sentiment analysis, and spam detection.
* **Medical Diagnosis:** Random Forest can assist in diagnosing diseases by analyzing patient data, such as symptoms, medical history, and test results.

**Comparison:**

* **Accuracy:** Both XGBoost and Random Forest are highly accurate algorithms. However, XGBoost tends to perform slightly better on complex datasets.
* **Speed:** XGBoost is generally faster than Random Forest, especially for large datasets.
* **Interpretability:** Random Forest is more interpretable than XGBoost, as it is easier to understand the individual decision trees that make up the ensemble.
* **Hyperparameter Tuning:** XGBoost requires more hyperparameter tuning than Random Forest, which can be time-consuming.

**Conclusion:**

XGBoost and Random Forest are both powerful machine learning algorithms with a wide range of applications. XGBoost is particularly well-suited for tasks that require high accuracy and speed, while Random Forest is more appropriate when interpretability is important. The choice between the two algorithms depends on the specific requirements of the problem at hand.



For top p value 1.0, the results are: 




**XGBoost (Extreme Gradient Boosting)**

XGBoost is a machine learning algorithm that combines the power of gradient boosting with decision trees. It is known for its high accuracy and efficiency, making it a popular choice for a wide range of applications.

**Key Concepts:**

* **Gradient Boosting:** XGBoost uses gradient boosting, which involves iteratively building decision trees to minimize the loss function.
* **Regularization:** XGBoost includes regularization terms to prevent overfitting and improve generalization.
* **Tree Pruning:** XGBoost employs tree pruning techniques to reduce the complexity of the model and enhance its interpretability.

**Real-Life Use Cases:**

* **Fraud Detection:** XGBoost can identify fraudulent transactions by analyzing patterns in financial data.
* **Customer Churn Prediction:** It can predict the likelihood of customers leaving a service or product based on their behavior and demographics.
* **Medical Diagnosis:** XGBoost assists in diagnosing diseases by analyzing medical images and patient records.

**Random Forest**

Random Forest is an ensemble learning algorithm that combines multiple decision trees to make predictions. It is known for its robustness and ability to handle large datasets.

**Key Concepts:**

* **Ensemble Learning:** Random Forest creates a collection of decision trees, each trained on a different subset of the data.
* **Bagging:** It uses bagging, where each tree is trained on a random sample of the data with replacement.
* **Feature Randomization:** Random Forest randomly selects a subset of features for each tree, reducing the risk of overfitting.

**Real-Life Use Cases:**

* **Image Classification:** Random Forest can classify images into different categories, such as animals, objects, or scenes.
* **Natural Language Processing:** It can analyze text data for tasks like sentiment analysis and spam detection.
* **Financial Forecasting:** Random Forest can predict stock prices or economic indicators based on historical data.

**Comparison:**

| Feature | XGBoost | Random Forest |
|---|---|---|
| Accuracy | Higher | Comparable |
| Efficiency | Lower | Higher |
| Regularization | Yes | No |
| Tree Pruning | Yes | No |
| Interpretability | Lower | Higher |
| Ensemble Method | Gradient Boosting | Bagging |

**Conclusion:**

XGBoost and Random Forest are powerful machine learning algorithms with distinct strengths and weaknesses. XGBoost excels in tasks requiring high accuracy and efficiency, while Random Forest is more suitable for applications where interpretability and robustness are important. By understanding the concepts and use cases of these algorithms, practitioners can select the most appropriate tool for their specific problem.



For top p value 1.0, the results are: 




**XGBoost**

XGBoost (eXtreme Gradient Boosting) is a powerful ensemble machine learning algorithm that combines the strengths of gradient boosting and regularization. It is widely used in a variety of machine learning tasks, including classification, regression, and ranking.

**Key Concepts:**

* **Gradient Boosting:** XGBoost builds an ensemble of decision trees, where each subsequent tree is trained to correct the errors of the previous trees.
* **Regularization:** XGBoost employs various regularization techniques, such as L1 and L2 regularization, to prevent overfitting and improve generalization performance.
* **Scalability:** XGBoost is highly scalable and can handle large datasets with millions of rows and thousands of features.

**Real-Life Use Cases:**

* **Predicting Customer Churn:** XGBoost can be used to identify customers at risk of churning and take proactive measures to retain them.
* **Fraud Detection:** XGBoost is effective in detecting fraudulent transactions in financial datasets.
* **Credit Scoring:** XGBoost models can help banks and other lenders assess the creditworthiness of borrowers.

**Random Forest**

Random Forest is an ensemble learning algorithm that consists of a multitude of decision trees. It operates on the principle of bootstrap aggregation, where multiple decision trees are trained on different subsets of the data.

**Key Concepts:**

* **Bootstrap Aggregation:** Random Forest trains multiple decision trees on random subsets of the data, with replacement.
* **Majority Voting:** Predictions are made by combining the outputs of all the individual decision trees, typically through majority voting or averaging.
* **Feature Randomness:** Random Forest introduces randomness in feature selection, where a subset of features is randomly selected for each decision tree.

**Real-Life Use Cases:**

* **Image Classification:** Random Forest is widely used in image classification tasks, such as object detection and scene recognition.
* **Text Classification:** Random Forest can be used to classify text documents into different categories, such as spam filtering or sentiment analysis.
* **Medical Diagnosis:** Random Forest models can assist medical professionals in diagnosing diseases and predicting patient outcomes.

**Comparison**

Both XGBoost and Random Forest are powerful ensemble learning algorithms, but they differ in several aspects:

* **Complexity:** XGBoost is generally more complex and requires more hyperparameter tuning compared to Random Forest.
* **Accuracy:** XGBoost often achieves higher accuracy than Random Forest, especially on complex datasets.
* **Interpretability:** Random Forest is generally more interpretable than XGBoost, as it is easier to understand the decision-making process of individual decision trees.
* **Scalability:** XGBoost is more scalable than Random Forest and can handle larger datasets more efficiently.



For top p value 1.0, the results are: 




**XGBoost (Extreme Gradient Boosting)**

**Concept:**
XGBoost is a powerful machine learning algorithm that combines multiple weak models (decision trees) into a single strong model. It utilizes gradient boosting, where each new tree iteratively corrects the errors of the previous trees by focusing on data points that were misclassified.

**Real-Life Use Case:**
**Fraud Detection:** XGBoost is effective in identifying fraudulent transactions in financial datasets. It analyzes historical transaction patterns to detect anomalies and suspicious activities.

**Random Forest**

**Concept:**
Random Forest is an ensemble learning method that creates multiple decision trees during training. Each tree is trained on a different subset of the data and uses a different feature subset for splitting at each node. The final prediction is made by combining the predictions from all individual trees.

**Real-Life Use Case:**
**Object Recognition:** Random Forest is commonly used in computer vision for object recognition. It trains multiple trees on different patches or regions of an image to improve accuracy and robustness.

**Comparison:**

* **Accuracy:** Both XGBoost and Random Forest can achieve high accuracy, but XGBoost is generally considered more accurate, especially for complex and structured datasets.
* **Interpretability:** Random Forest is more interpretable than XGBoost as the individual decision trees can be analyzed to understand the model's decision-making process.
* **Computational Complexity:** XGBoost is more computationally intensive and requires longer training times than Random Forest.
* **Tuning:** XGBoost has more hyperparameters to tune compared to Random Forest, which can be challenging but can lead to better performance.

**Additional Use Cases:**

**XGBoost:**
* Sentiment Analysis
* Click-Through Rate Prediction
* Credit Risk Assessment

**Random Forest:**
* Image Classification
* Medical Diagnosis
* Time Series Analysis

## Task 8: Experiment With the candidate_count Parameter

In [10]:
config = genai.types.GenerationConfig(candidate_count=1)
result = get_response("Explain the concepts of XGBoost and Random Forest with real-life use cases", generation_config=config)
Markdown(result.text)


**XGBoost (Extreme Gradient Boosting)**

**Concept:**
XGBoost is a powerful machine learning algorithm that combines the principles of gradient boosting and decision trees. It sequentially builds decision trees, where each tree focuses on correcting the errors made by previous trees.

**Real-Life Use Cases:**
* Fraud Detection: Identifying fraudulent transactions by learning from historical data.
* Click-Through Rate Prediction: Predicting the probability of a user clicking on an ad based on website features.
* Credit Risk Assessment: Evaluating the risk of default for loan applicants.

**Random Forest**

**Concept:**
Random Forest is an ensemble learning algorithm that constructs multiple decision trees during training. Each tree randomly selects a subset of features and samples, creating a diverse set of trees. Predictions are made by combining the predictions of all individual trees.

**Real-Life Use Cases:**
* Image Classification: Classifying images into different categories, such as animals, objects, or scenes.
* Natural Language Processing: Understanding and extracting meaning from text, including sentiment analysis and topic modeling.
* Medical Diagnosis: Assisting doctors in diagnosing diseases by analyzing patient data.

**Comparison:**

| Feature | XGBoost | Random Forest |
|---|---|---|
| Algorithm Type | Gradient Boosting | Ensemble Learning |
| Base Learner | Decision Trees | Decision Trees |
| Feature Selection | Weighted by importance | Random selection |
| Model Complexity | Higher | Lower |
| Training Time | Longer | Shorter |
| Regularization | Regularized by tree depth | Regularized by number of trees |
| Performance | Usually higher | Can vary depending on the dataset |

**Choice of Algorithm:**

* For highly complex tasks and when interpretability is less important, **XGBoost** is often preferred due to its superior performance.
* For tasks where interpretability and computational efficiency are crucial, **Random Forest** may be a better choice.

## Task 9: Introduction to Retrieval Augmented Generation

## Task 10: Load the PDF and Extract the Texts

In [11]:
CHUNK_SIZE = 700
CHUNK_OVERLAP = 100
pdf_path = "https://www.analytixlabs.co.in/assets/pdfs/Data_Engineering%20&_Other_Job_Roles-AnalytixLabs.pdf"


In [12]:
pdf_loader = PyPDFLoader(pdf_path)
split_pdf_document = pdf_loader.load_and_split()


In [13]:
# Splitting text into chunks
text_splitter = RecursiveCharacterTextSplitter(chunk_size=CHUNK_SIZE, chunk_overlap=CHUNK_OVERLAP)
context = "\n\n".join(str(p.page_content) for p in split_pdf_document)
texts = text_splitter.split_text(context)


## Task 11: Create the Gemini Model and Create the Embeddings

In [16]:
gemini_model = ChatGoogleGenerativeAI(model='gemini-pro', google_api_key="", temperature=0.8)

In [18]:
embeddings = GoogleGenerativeAIEmbeddings(model="models/embedding-001", google_api_key="")

In [20]:
vector_index = Chroma.from_texts(texts, embeddings)
retriever = vector_index.as_retriever(search_kwargs={"k" : 5})

## Task 12: Create the RAG Chain and Ask Query

In [21]:
qa_chain = RetrievalQA.from_chain_type(gemini_model, retriever=retriever, return_source_documents=True)

In [22]:
# Example usage 
question = "Which tools do Data Engineers primarily work with?"
result = qa_chain.invoke({"query": question})
print("Answer:", result["result"])


Answer: * **Data integration tools:** These tools help data engineers connect to and extract data from various data sources, such as databases, spreadsheets, and cloud storage. Examples include Informatica PowerCenter, Talend Data Integration, and Microsoft SQL Server Integration Services (SSIS).
* **Data transformation tools:** These tools help data engineers clean, transform, and prepare data for analysis and modeling. Examples include Apache Spark, Apache Hadoop, and Flink.
* **Data warehousing tools:** These tools help data engineers create and manage data warehouses, which are central repositories of data for analysis and reporting. Examples include Amazon Redshift, Google BigQuery, and Microsoft Azure Synapse Analytics.
* **Data visualization tools:** These tools help data engineers create visual representations of data, such as charts, graphs, and dashboards. Examples include Tableau, Power BI, and Google Data Studio.
* **Data modeling tools:** These tools help data engineers cr