<a href="https://colab.research.google.com/github/ektedar/LangChainTest/blob/main/LangChainTest.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# LangChain

Objective is to investigate how LangChain interacts and works with GPT 4.0 and how we can use it to template and/or analyze data.  

1. We need to find a data that we can play
2. We need to get the API token from OpenAI
3. We need to install LangChain and understand its inner workings

## Install LangChain

In [2]:
!pip install langchain openai

Collecting langchain
  Downloading langchain-0.0.220-py3-none-any.whl (1.2 MB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/1.2 MB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.2/1.2 MB[0m [31m5.9 MB/s[0m eta [36m0:00:01[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m1.2/1.2 MB[0m [31m17.1 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.2/1.2 MB[0m [31m14.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting openai
  Downloading openai-0.27.8-py3-none-any.whl (73 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m73.6/73.6 kB[0m [31m7.3 MB/s[0m eta [36m0:00:00[0m
Collecting dataclasses-json<0.6.0,>=0.5.7 (from langchain)
  Downloading dataclasses_json-0.5.9-py3-none-any.whl (26 kB)
Collecting langchainplus-sdk>=0.0.17 (from langchain)
  Downloading langchainplus_sdk-0.0.18-py3-none-any.w

# Sample Example

Running a sample example from the LangChain documentation

In [3]:
from langchain.llms import OpenAI
from langchain.prompts import PromptTemplate

## Get the OpenAI API Key

Setting the OpenAI API Key for usage down the cells

In [32]:
import os
import openai
openai.organization = os.getenv("ORG_ID")
openai.api_key = os.getenv("OPENAI_API_KEY")

# Loading LLMS

Settings the prompt template and the input variables.  So from this, it seems as if you have to set the template and designate an input variable that the prompt will swap out. The prompt will then be passed along to OpenAI and whatever model is invoked by the `OPENAI_API_KEY`

In [27]:
llm = OpenAI(temperature=0.9) # Dont quite understand what the temperature truly does but sounds like it adds to randomness
prompt = PromptTemplate(
    input_variables=["city"],
    template="What is a good NFL name for a {city} that is already not in the NFL?"
)

## Running the Chain

Now we take a user input, pass it through the variable and format the prompt with it, send it to the LLM for its output.

In [28]:
from langchain.chains import LLMChain
chain = LLMChain(
    llm=llm,
    prompt=prompt
)

In [34]:
chain.run('Norwich').strip()

'Norwich Fury'

# Data Analysis with LangChain

Let's attempt to analyze some data with LangChain and see how it performs. Let's grab some dataset from online, ideally from `Kaggle` and see if we can run some analysis on the dataset using LangChain.

In [35]:
import pandas as pd

df = pd.read_csv('Basic_Stats.csv')

In [37]:
df.shape

(17172, 16)

In [38]:
df.head()

Unnamed: 0,Age,Birth Place,Birthday,College,Current Status,Current Team,Experience,Height (inches),High School,High School Location,Name,Number,Player Id,Position,Weight (lbs),Years Played
0,,"Grand Rapids , MI",5/23/1921,Notre Dame,Retired,,3 Seasons,71.0,,,"Evans, Fred",,fredevans/2513736,,185.0,1946 - 1948
1,,"Dayton , OH",12/21/1930,Dayton,Retired,,1 Season,70.0,,,"Raiff, Jim",,jimraiff/2523700,,235.0,1954 - 1954
2,56.0,"Temple , TX",9/11/1960,Louisiana Tech,Retired,,1 Season,74.0,,,"Fowler, Bobby",,bobbyfowler/2514295,,230.0,1985 - 1985
3,30.0,"New Orleans , LA",9/30/1986,LSU,Retired,,5 Seasons,73.0,,,"Johnson, Quinn",,quinnjohnson/79593,,255.0,2009 - 2013
4,25.0,"Detroit , MI",3/31/1992,Central Michigan,Active,Pittsburgh Steelers,3rd season,77.0,Clintondale HS,"Clinton Twp.,Macomb Co., MI","Walton, L.T.",96.0,l.t.walton/2552444,DE,305.0,


### NFL Data

Here we are playing with some NFL data. We are going to see how much of analysis and data transformation we can do without actually having to write anything in `Pandas`. The goal of this is to understand how LangChain works in terms of data analysis and its capabilites surrounding it.

In [40]:
from langchain.agents import create_pandas_dataframe_agent
from langchain.chat_models import ChatOpenAI
from langchain.agents.agent_types import AgentType
from langchain.llms import OpenAI

We will be creating an agent that will be interacting with the dataframe for us and will be using Python code underneath the hood.  Since it will be generating Python code, they have advised to use the agent cautiously. Best bet would be to test out the agent on a server rather than on a personal machine.

So when initializing the agent, we are conducting the following operations:
1. We are creating an `OpenAI` model with a specified temperature. Here is the opportunity to swap out the models and try with another model but essentially this is where you are initializing the model
2. We are passing in the loaded `df` into the model.  Questions are, what type does the `df` need to be and are there any limitations when assigning the `df` to the `agent`.
3. `verbose=True` is probably just to view the logs. This parameter will probably go along with `debug=True` if we are looking to implement it in a programmatic manner.

In [41]:
agent = create_pandas_dataframe_agent(
    OpenAI(temperature=0),
    df,
    verbose=True
)

In [44]:
agent.run("How many players have retired that are around the age of 34")



[1m> Entering new  chain...[0m
[32;1m[1;3mThought: I need to find the players that have retired and are around the age of 34
Action: python_repl_ast
Action Input: df[(df['Current Status'] == 'Retired') & (df['Age'] >= 33) & (df['Age'] <= 35)][0m
Observation: [36;1m[1;3m        Age     Birth Place    Birthday               College Current Status  \
12     33.0             NaN   12/4/1983           Boise State        Retired   
29     34.0             NaN    3/3/1983          Georgia Tech        Retired   
47     33.0  Bradenton , FL    6/9/1983              Nebraska        Retired   
67     34.0             NaN   5/29/1983      Eastern Michigan        Retired   
77     33.0   San Jose , CA    2/6/1984      Washington State        Retired   
...     ...             ...         ...                   ...            ...   
17101  33.0   Portland , OR   9/16/1983           Boise State        Retired   
17129  35.0             NaN    5/4/1982  North Carolina State        Retired   
1

'895 players have retired that are around the age of 34.'

Let's give the model something more intense. That requires more than just small analysis of what's going on.

In [46]:
game_logs_df = pd.read_csv('Game_Logs_Quarterback.csv')

In [64]:
game_logs_df

Unnamed: 0,Player Id,Name,Position,Year,Season,Week,Game Date,Home or Away,Opponent,Outcome,...,Ints,Sacks,Sacked Yards Lost,Passer Rating,Rushing Attempts,Rushing Yards,Yards Per Carry,Rushing TDs,Fumbles,Fumbles Lost
0,jaredzabransky/2495791,"Zabransky, Jared",,2007,Preseason,1,08/11,Home,CHI,L,...,--,--,--,0.0,--,--,--,--,--,--
1,jaredzabransky/2495791,"Zabransky, Jared",,2007,Preseason,2,08/18,Away,ARI,W,...,0,0,0,46.9,--,--,--,--,--,--
2,jaredzabransky/2495791,"Zabransky, Jared",,2007,Preseason,3,08/25,Home,DAL,W,...,--,--,--,0.0,--,--,--,--,--,--
3,jaredzabransky/2495791,"Zabransky, Jared",,2007,Preseason,4,08/30,Away,TB,L,...,1,2,13,50.8,2,17,8.5,0,--,--
4,billdemory/2512778,"Demory, Bill",,1974,Regular Season,1,09/15,Away,KC,L,...,--,--,--,0.0,--,--,--,--,--,--
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
40242,steveramsey/2523725,"Ramsey, Steve",,1971,Regular Season,12,12/05,Home,CHI,W,...,2,3,27,25.3,2,0,0.0,0,--,--
40243,steveramsey/2523725,"Ramsey, Steve",,1971,Regular Season,13,12/12,Away,SD,L,...,3,1,6,37.3,--,--,--,--,--,--
40244,steveramsey/2523725,"Ramsey, Steve",,1971,Regular Season,14,12/19,Away,OAK,L,...,1,2,13,57.5,1,6,6.0,0,--,--
40245,steveramsey/2523725,"Ramsey, Steve",,1970,Regular Season,13,12/13,Home,SF,L,...,--,--,--,0.0,--,--,--,--,--,--


In [51]:
agent = create_pandas_dataframe_agent(
    OpenAI(temperature=0),
    game_logs_df,
    verbose=True
)

In [58]:
len(game_logs_df['Name'].unique())

517

In [59]:
agent.run("Filter the dataset and only show me Michael Vick's data.  Store this into a new dataframe called mv_df")



[1m> Entering new  chain...[0m
[32;1m[1;3mThought: I need to filter the dataframe to only show Michael Vick's data
Action: python_repl_ast
Action Input: mv_df = df[df['Name'] == 'Vick, Michael'][0m
Observation: [36;1m[1;3m[0m
Thought:[32;1m[1;3m I now have a new dataframe with only Michael Vick's data
Final Answer: mv_df is a new dataframe with only Michael Vick's data.[0m

[1m> Finished chain.[0m


"mv_df is a new dataframe with only Michael Vick's data."

In [60]:
mv_df

NameError: ignored

Looks like the agent is only allowed to do the analyses on the scope of the dataset and now allowed to do anything else other than that. For example, on the code above, even though I specified to create a new dataframe called `mv_df`, from the logs it seems as if it did, however it actually did not.

## Cross Pollinating Datasets

Let's try to create an agent that can run analysis on both of the datasets and then observe what kind of operations we can perform on it

In [61]:
agent = create_pandas_dataframe_agent(
    OpenAI(temperature=0),
    [df, game_logs_df],
    verbose=True
)

In [62]:
agent.run("Is Michael Vick in the df dataset")



[1m> Entering new  chain...[0m
[32;1m[1;3mThought: I need to find out if Michael Vick is in the dataset
Action: python_repl_ast
Action Input: df1[df1['Name'] == 'Vick, Michael'][0m
Observation: [36;1m[1;3mEmpty DataFrame
Columns: [Age, Birth Place, Birthday, College, Current Status, Current Team, Experience, Height (inches), High School, High School Location, Name, Number, Player Id, Position, Weight (lbs), Years Played]
Index: [][0m
Thought:[32;1m[1;3m I need to check the second dataset
Action: python_repl_ast
Action Input: df2[df2['Name'] == 'Vick, Michael'][0m
Observation: [36;1m[1;3mEmpty DataFrame
Columns: [Player Id, Name, Position, Year, Season, Week, Game Date, Home or Away, Opponent, Outcome, Score, Games Played, Games Started, Passes Completed, Passes Attempted, Completion Percentage, Passing Yards, Passing Yards Per Attempt, TD Passes, Ints, Sacks, Sacked Yards Lost, Passer Rating, Rushing Attempts, Rushing Yards, Yards Per Carry, Rushing TDs, Fumbles, Fumbles

'No, Michael Vick is not in the dataset.'

In [63]:
agent.run("Cross reference to see if there is a player on both the dataframes")



[1m> Entering new  chain...[0m
[32;1m[1;3mThought: I need to compare the Player Ids
Action: python_repl_ast
Action Input: df1['Player Id'].isin(df2['Player Id'])[0m
Observation: [36;1m[1;3m0        False
1        False
2        False
3        False
4        False
         ...  
17167    False
17168    False
17169    False
17170    False
17171    False
Name: Player Id, Length: 17172, dtype: bool[0m
Thought:[32;1m[1;3m I now know the final answer
Final Answer: No, there is no player on both dataframes.[0m

[1m> Finished chain.[0m


'No, there is no player on both dataframes.'

In [68]:
agent.run("how many wins did mike vick have in his career?")



[1m> Entering new  chain...[0m
[32;1m[1;3mThought: I need to find out how many games Mike Vick played and how many of those he won
Action: python_repl_ast
Action Input: df1[df1['Name'] == 'Vick, Mike']['Outcome'].value_counts()[0m
Observation: [36;1m[1;3mKeyError: 'Outcome'[0m
Thought:[32;1m[1;3m I need to look at the other dataframe
Action: python_repl_ast
Action Input: df2[df2['Name'] == 'Vick, Mike']['Outcome'].value_counts()[0m
Observation: [36;1m[1;3mL    127
W    126
T      1
Name: Outcome, dtype: int64[0m
Thought:[32;1m[1;3m I now know the final answer
Final Answer: Mike Vick had 126 wins in his career.[0m

[1m> Finished chain.[0m


'Mike Vick had 126 wins in his career.'

In [72]:
agent.run("count the number of wins that mike vick has")



[1m> Entering new  chain...[0m
[32;1m[1;3mThought: I need to find out which player is Mike Vick and then count the number of wins
Action: python_repl_ast
Action Input: df[df['Name'] == 'Vick, Mike'][0m
Observation: [36;1m[1;3mNameError: name 'df' is not defined[0m
Thought:[32;1m[1;3m I need to define the dataframe
Action: python_repl_ast
Action Input: df1[df1['Name'] == 'Vick, Mike'][0m
Observation: [36;1m[1;3m        Age        Birth Place   Birthday        College Current Status  \
16031  36.0  Newport News , VA  6/26/1980  Virginia Tech        Retired   

      Current Team  Experience  Height (inches) High School  \
16031          NaN  15 Seasons             72.0         NaN   

      High School Location        Name  Number         Player Id Position  \
16031                  NaN  Vick, Mike     NaN  mikevick/2504531      NaN   

       Weight (lbs) Years Played  
16031         210.0  2001 - 2015  [0m
Thought:[32;1m[1;3m I need to filter the dataframe to only inc

'Mike Vick has 126 wins.'

In [78]:
agent.run("how many times out of those 126 games, did he lose the fumble")



[1m> Entering new  chain...[0m
[32;1m[1;3mThought: I need to find out how many times he fumbled and how many times he lost the fumble
Action: python_repl_ast
Action Input: df1[df1['Fumbles'] > 0]['Fumbles Lost'].sum()[0m
Observation: [36;1m[1;3mKeyError: 'Fumbles'[0m
Thought:[32;1m[1;3m I need to look at the other dataframe
Action: python_repl_ast
Action Input: df2[df2['Fumbles'] > 0]['Fumbles Lost'].sum()[0m
Observation: [36;1m[1;3mTypeError: '>' not supported between instances of 'str' and 'int'[0m
Thought:[32;1m[1;3m I need to convert the Fumbles column to int
Action: python_repl_ast
Action Input: df2['Fumbles'] = df2['Fumbles'].astype(int)[0m
Observation: [36;1m[1;3m[0m
Thought:[32;1m[1;3m Now I can find the sum of Fumbles Lost
Action: python_repl_ast
Action Input: df2[df2['Fumbles'] > 0]['Fumbles Lost'].sum()[0m
Observation: [36;1m[1;3m2475[0m
Thought:[32;1m[1;3m I now know the final answer
Final Answer: 2475 times out of those 126 games, he lost the 

'2475 times out of those 126 games, he lost the fumble.'

In [95]:
W_df = game_logs_df[game_logs_df['Outcome'] == 'W']
W_df[W_df['Name']=="Vick, Mike"]

Unnamed: 0,Player Id,Name,Position,Year,Season,Week,Game Date,Home or Away,Opponent,Outcome,...,Ints,Sacks,Sacked Yards Lost,Passer Rating,Rushing Attempts,Rushing Yards,Yards Per Carry,Rushing TDs,Fumbles,Fumbles Lost
36275,mikevick/2504531,"Vick, Mike",,2015,Regular Season,2,09/20,Home,SF,W,...,--,--,--,0.0,3,-3,-1.0,0,0,0
36276,mikevick/2504531,"Vick, Mike",,2015,Regular Season,3,09/27,Away,STL,W,...,0,2,11,93.1,2,-2,-1.0,0,1,0
36278,mikevick/2504531,"Vick, Mike",,2015,Regular Season,5,10/12,Away,SD,W,...,1,3,9,73.1,1,24,24.0,0,0,0
36279,mikevick/2504531,"Vick, Mike",,2015,Regular Season,6,10/18,Home,ARI,W,...,0,1,5,45.8,5,47,9.4,0,0,0
36282,mikevick/2504531,"Vick, Mike",,2015,Regular Season,9,11/08,Home,OAK,W,...,--,--,--,0.0,--,--,--,--,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
36515,mikevick/2504531,"Vick, Mike",,2001,Regular Season,6,10/21,Away,NO,W,...,--,--,--,0.0,--,--,--,--,0,0
36517,mikevick/2504531,"Vick, Mike",,2001,Regular Season,9,11/11,Home,DAL,W,...,0,2,5,70.1,2,4,2.0,0,1,1
36518,mikevick/2504531,"Vick, Mike",,2001,Regular Season,10,11/18,Away,GB,W,...,--,--,--,0.0,--,--,--,--,0,0
36519,mikevick/2504531,"Vick, Mike",,2001,Regular Season,11,11/25,Away,CAR,W,...,--,--,--,0.0,--,--,--,--,0,0


In [96]:
agent.run("when was mike vick's last game?")



[1m> Entering new  chain...[0m
[32;1m[1;3mThought: I need to find Mike Vick's player id
Action: python_repl_ast
Action Input: df1[df1['Name'] == 'Vick, Mike'][0m
Observation: [36;1m[1;3m        Age        Birth Place   Birthday        College Current Status  \
16031  36.0  Newport News , VA  6/26/1980  Virginia Tech        Retired   

      Current Team  Experience  Height (inches) High School  \
16031          NaN  15 Seasons             72.0         NaN   

      High School Location        Name  Number         Player Id Position  \
16031                  NaN  Vick, Mike     NaN  mikevick/2504531      NaN   

       Weight (lbs) Years Played  
16031         210.0  2001 - 2015  [0m
Thought:[32;1m[1;3m I need to find Mike Vick's games
Action: python_repl_ast
Action Input: df2[df2['Player Id'] == 'mikevick/2504531'][0m
Observation: [36;1m[1;3m              Player Id        Name Position  Year          Season  Week  \
36272  mikevick/2504531  Vick, Mike      NaN  2015     

"Mike Vick's last game was on December 31, 2006."

## Final Thoughts

This dataset is definitely not right. There accumated data does not add up to what is actually in NFL.  `agent` is cool to use but you have to be very precise with the dataset but still cool nonetheless.  I've noticed that there was a time where it was stuck in an infinite loop error trying to find a column and not finding it, even thought it already existed.