# Hi! I'm David
*I'm an AI engineer following the resources (commodities) industries*. I'm interested in automating manual work, especially in mining, and improving the collective understanding of resource ventures so investors don't get ruged in today's critical mineral supercycle.

You can reach me on [linkedin](https://linkedin.com/in/davidimprovz)  |  [twitter](https://twitter.com/d_comfe)  |  [email](mailto:drifft.hello@gmail.com)  |  [my free newsletter](https://drifft.beehiiv.com)

Here's a quick tutorial on how to use AI to reason over weekly news feeds on mining and artificial intelligence.

**Outline** 
1. Install prerequisites 
2. Perform a web search and scrape all the news.
3. Transform the search data 
4. (the fun part) Use an AI to reason over the news


## 1. Prereqs

In [None]:
!pip install -qU \
  chromadb \
  langchain==0.0.340 \
  xformers==0.0.20 \
  pandas \
  duckduckgo-search \
  openai \
  scrapy \
  scrapydo \
  html2text 

Next step: Critical! 

You can either use openai or llama2. 

If the latter, you have to get llama2 running. I find the easiest way is to 
* Download and install the ollama client at [ollama.ai](ollama.ai),
* Once installed, start the application by clicking on the icon, 
* Open your terminal and type in `ollama pull llama2:7b-chat`.

If you're running linux OS (or google colab) you can do:
```
!curl https://ollama.ai/install.sh | sh
!nohup ollama serve &
!ollama pull llama2
```

The other way is to use [HuggingFace](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf), but I find it's easier to run the client locally rather than deal with permissions and downloads. Let someone else manage those details if you're only experimenting.

In [2]:
import json
from getpass import getpass
import warnings
import datetime
import os
from time import time
from pprint import pprint
warnings.filterwarnings('ignore')


## 2. Web Search
Using DuckDuckGo Search

In [None]:

from duckduckgo_search import DDGS
import pandas as pd

query = 'intitle:mining intitle:mineral -intitle:"text mining" -"Text Mining" -"text mining" -bitcoin -crypto -"data mining" intitle:"artificial intelligence"'

with DDGS() as ddgs:
    # # general search
    # regular_search = [item for item in ddgs.text(query, 
    #                     region='us', 
    #                     safesearch='on',
    #                     timelimit='w', # for the week 
    #                     max_results=200)]
    
    # combine this with an exclusive news search 
    news_search = [item for item in ddgs.news(query, 
                        region='us', 
                        safesearch='on',
                        timelimit='w', # for the week 
                        max_results=100)]
    
    # # combine in a table 
    # regular_search = pd.DataFrame(regular_search)
    news_search = pd.DataFrame(news_search)
    # search = pd.concat([regular_search, news_search], axis=0)
    
    # let's make sure only this month's news in case older ones slip through 
    news_search.date = pd.to_datetime(news_search.date)
    filter_date = news_search['date'].dt.year == 2023 
    filter_month = news_search['date'].dt.month == 11
    news_search = news_search[filter_date & filter_month]
    
print(f"{news_search.index.size} records with columns {news_search.columns}")

Now it's time to scrape the web using the links that the search gave us.

In [160]:
import scrapy
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging
from twisted.internet import reactor
import scrapydo
from html2text import html2text

In [161]:
# this is a simple scraper to extract all text from a webpage. 
# simply run the cell. you don't ned to change anything. 
# note: it doesn't handle PDFs or videos, etc.

class MiningSearchText(scrapy.Item):
    text = scrapy.Field()

class MiningSearch(scrapy.Spider):
    name = "mining_spider"

    def __init__(self, *args, **kwargs):
        super(MiningSearch, self).__init__(*args, **kwargs)
        self.start_urls = kwargs.get('urls', [])
        # set the data save and set keys to hashes of all urls
        self.headers = kwargs.get('headers', {
                "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36",
                "Accept-Language": "en-US,en;q=0.9",
            })

    def start_requests(self):
        if not self.start_urls:
            raise ValueError("No URLs provided to scrape.")
        for url in self.start_urls:
            # set the link
            yield scrapy.Request(url=url, callback=self.parse, headers=self.headers)

    def parse(self, response):
        item = MiningSearchText()
        item['text'] = response

        yield item

In [None]:
# Run a quick scrape. This will take about a minute or two depending on your internet speed.

scrape_urls = list(news_search.url.values)
scrapydo.setup()
data = scrapydo.run_spider(MiningSearch, **{'urls': scrape_urls}) # to do: put into vectroized call
len(data)

In [165]:
# now clean up the scraped data ... quick and dirty.

text_data = pd.Series([item['text'] for item in data])
text_data = text_data.apply(lambda x: html2text(x.text) if hasattr(x, 'text') else None)
text_data = text_data.str.replace(r'\n{2,}|\s{3,}', ' ').str.strip()

In [None]:
# then combine the original links with the associated text so we can reference where we got our data later 

scrape_data = pd.concat([pd.Series(scrape_urls), text_data], axis=1).dropna()
scrape_data.columns = ['urls','text']
scrape_data.tail(2)

## 3. Transform Data
Now we need to convert the data to something called `vectors`. 
Vectors are number representations of the text that help us find all data relevant to any question we ask.

In [167]:
from langchain.document_loaders import DataFrameLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter # look up others
from langchain.docstore.document import Document


### Prep Documents

In [None]:
# all our data is contained in the `text` column of the scrape_data table 
scrape_data.tail(2)

In [172]:
loader = DataFrameLoader(scrape_data, page_content_column="text")
docs = loader.load()

In [173]:
# ai models are limited in how much data they can process at one time, 
# so we want to split up the data into `documents`
# we will use langchain's helper to load in a pandas df, 
# which will preserve the links for attribution.

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
splits = text_splitter.split_documents(docs)

### Create a vector DB
We want to vectorize all the text and stuff it into a database designed for that data.

In [139]:
# instantiate chromadb, a popular vector DB

from langchain.embeddings.sentence_transformer import SentenceTransformerEmbeddings
from langchain.vectorstores import Chroma

In [140]:
# we create the vectors, which are called `embeddings`, using another model
# it can take a few seconds to download this model.

embedding_function = SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2")

In [174]:
# then we just load the documents into a vector database
# this can take a few min to run depending on the amount 
# of data that you collected in your search 

db = Chroma.from_documents(splits, embedding_function)

In [176]:
# now when we ask a question..
# the db will return data that are `semantically` similar to our search

query = "What artificial intelligence models are being talked about in mining?"
db.similarity_search(
    query, 
    k=5, # how many results you want to return
)

[Document(page_content='Mining is a vast and diverse industry. Depending on the mined product, there\nare significant differences in the technologies and processes used, making it\nchallenging to provide a comprehensive overview of the entire industry. TThis\narticle will explore how the mining industry uses or attempts to use\nartificial intelligence to enhance productivity and efficiency, which\nincludes:\n\n **1\\. AI Mineral Exploration**\n\n **2\\. Autonomous Vehicles**\n\n **3\\. Predictive Maintenance and Health Management (PHM) Systems**\n\n **4\\. AI Sorting**\n\n  \n\n ** _1\\. Mineral Exploration_**\n\nMineral exploration is crucial for mining operations. A resource-rich and\nhigh-grade deposit can enable a mining company to achieve efficient\nautomation, while a low-grade deposit might not be economically viable. The\nintroduction of artificial intelligence and machine learning into mineral\nexploration has attracted significant attention in the industry.', metadata={'urls'

## 4. Explore News
We use the word similarity in news articles to return infromation that's related to our question, and then ask the AI model to do our bidding. 

In [None]:
# if you're using llama2, let's make sure we have a llama2 model to use 
# if you're using ollama like I recommend, 
# just copy the NAME field that this command puts out

!ollama list

In [209]:
# some tools we need 

from langchain.chat_models import ChatOllama, ChatOpenAI
from langchain.callbacks.manager import CallbackManager
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler
from langchain.schema import StrOutputParser
from langchain.schema.runnable import RunnablePassthrough

In [214]:
# if you want it super fast, you have to pay openai
# signup and get an api key from openai.com/

os.environ['OPENAI_API_KEY'] = getpass() 

openai_llm = ChatOpenAI(
    model="gpt-3.5-turbo", 
    temperature=0.0,        # controls how creative the ai can be. 1 will be loopy.
)


# use free opensource llama2:7b model, which does a great job for what we need it to do
ollama_llm = ChatOllama(
    model="llama2:7b",      # paste the NAME field here
    temperature=0.0,        # controls how creative the ai can be. 1 will be loopy.
    max_new_tokens=512,     # limits the number of words that are output
    repetition_penalty=1.1  # without this, output begins repeating..repeating..repeating.
)

In [None]:
# if we just want to ask the db questions, 
# we combine our ai model with the vector db.

from langchain.chains import RetrievalQA

# openai_rag_chain = RetrievalQA.from_chain_type(
#     llm=openai_llm, 
#     retriever=db.as_retriever(),
# )

ollama_rag_chain = RetrievalQA.from_chain_type(
    llm=ollama_llm, 
    retriever=db.as_retriever(),
)

# answer = openai_rag_chain("which companies are mentioned in the context of artificial intelligence?")
answer = ollama_rag_chain("which companies are mentioned in the context of artificial intelligence?")
pprint(answer.result)


In [None]:
# if we want to ask more advanced questions and control 
# how the AI will respond, we can construct the 
# question with specific instructions the ai should follow


# we format our instructions 
from langchain.prompts import PromptTemplate
prompt = PromptTemplate.from_template(
"""
<s>[INST]
<<SYS>>{system}<</SYS>>
{instruction}
%CONTEXT
{context}
[/INST]
"""
)

# a retriever will extract documents from our vector db
retriever = db.as_retriever()

# and this function will combine our documents into a single piece of text 
def format_docs(docs):
    return "\n\n".join([d.page_content for d in docs])

# here's what that looks like when we ask a question 
pprint(
    format_docs(
        retriever.get_relevant_documents(
            'How many companies are building artificial intelligence solutions?',
        )
    )
)

In [215]:
# now we can make a `chain` that 
# takes the source material, system instructions 
# and the specific instruction for the output 
# we are looking for and give it to ai

rag_chain = (
    {
    "context": RunnablePassthrough(),
    "system": RunnablePassthrough(),
    "instruction": RunnablePassthrough(), 
    } 
    | prompt 
    | openai_llm #, or ollama_llm
    | StrOutputParser()
)


### (the fun part) Reason Over News

Note: if you're running on a machine without a GPU, this may be a tad slow. It's worth the wait, I promise. 

Alternatively, upload this file to your google drive and run it on [Google Colab](https://colab.research.google.com/), which offers a free GPU.

In [205]:
# let's  we retrieve the docs on a subject we wish to reason over

question = 'What kinds of artificial intelligence concepts and applications are being used in mining?'

context = retriever.get_relevant_documents(question)
formatted_context = format_docs(context)

#### Example 1: Summarize Content
First, let's try summarizing the content that we got so we have a general idea of this week's content.

In [212]:
# let's tell the ai how it should behave
system = """
`reset`
`no quotes`
`no explanations`
`no prompt`
`no self-reference`
`no apologies`
`no filler`
`just answer`
"""

# then we write our instructions 
instruction = """Using the context provided, create 
advanced bullet-point notes summarizing the important 
parts of the reading or topic. Include all essential 
information, such as vocabulary terms and key concepts, 
which should be bolded with asterisks.Remove any extraneous 
language, focusing only on the critical aspects of the 
passage or topic. Strictly base your notes on the provided 
text, without adding any external information."""


In [None]:
# then run it and see what we get 
answer = rag_chain.invoke({'context':context, 
                          'system':system,
                          'instruction':instruction})

pprint(answer)

#### Example 2: Evaluate the sentiment of a topic

In [217]:
system = """You're an expert social scientist and marketer 
whose job is to evaluate the sentiment of discussions on 
the topics given to you."""

instruction = """Organize the concepts described in the context 
provided and assign them a sentiment score from 0-5, with 5 being 
positive and 0 being negative. For each concept's score, include your
reasoning for why you assigned that score."""

In [218]:
answer = rag_chain.invoke({'context':context, 
                          'system':system,
                          'instruction':instruction})

pprint(answer)

('1. Advanced Sensor Technologies: Sentiment Score - 4\n'
 'Reasoning: The integration of advanced sensors in mining operations allows '
 'for real-time monitoring of various parameters, optimizing processes, '
 'improving safety, and enhancing resource efficiency. This technology is '
 'highly beneficial and contributes positively to the industry.\n'
 '\n'
 '2. Automation and Robotics: Sentiment Score - 5\n'
 'Reasoning: The use of autonomous vehicles, drones, and robotic systems in '
 'mining operations leads to increased efficiency and safety. Automation '
 'reduces the need for human presence in hazardous environments and enables '
 'more precise and controlled mining activities. This technology has a highly '
 'positive impact on the industry.\n'
 '\n'
 '3. Artificial Intelligence (AI) and Machine Learning: Sentiment Score - 5\n'
 'Reasoning: AI and machine learning technologies applied to data analysis, '
 'predictive modeling, and decision-making in mining operations optimize '


#### Example 3: Extractor
Pull out the names of entities that I'm looking for (e.g., companies, people) and count how many times they're mentioned

In [221]:
system = """You're a professional web news researcher who specializes 
in extracting names of companies and people from web articles."""

instruction = """List the names of companies and people in the 
provided context, if any. For each entity, provide a count for how many 
times their name appears."""

In [None]:
answer = rag_chain.invoke({'context':context, 
                          'system':system,
                          'instruction':instruction})

pprint(answer)

#### Example 4: Reverse Engineer
Take a technical concept currently in the news and generate an idea for our company that I can present to my boss.

In [223]:
system = """You're a Google Principal Engineer tasked 
with reverse engineering a concept described in the news."""

instruction = """Pick one concept described in the context 
provided and generate a software concept that will work 
on drones. Create an outline for the concept and produce 
some code templates that our software team can later start 
filling in."""

In [224]:
answer = rag_chain.invoke({'context':context, 
                          'system':system,
                          'instruction':instruction})

pprint(answer)

('Concept: AI-powered Drone Surveillance System for Mining Operations\n'
 '\n'
 'Outline:\n'
 '1. Introduction:\n'
 '   - Explain the need for advanced surveillance systems in mining '
 'operations.\n'
 '   - Discuss the benefits of using drones for surveillance.\n'
 '\n'
 '2. System Architecture:\n'
 '   - Describe the components of the AI-powered drone surveillance system.\n'
 '   - Explain how the system integrates with existing mining infrastructure.\n'
 '   - Discuss the communication and data transfer protocols used.\n'
 '\n'
 '3. Drone Control and Navigation:\n'
 "   - Provide code templates for controlling the drone's flight path.\n"
 '   - Explain how the drone can navigate through challenging terrains.\n'
 '   - Discuss the use of obstacle detection and avoidance algorithms.\n'
 '\n'
 '4. Real-time Monitoring and Data Collection:\n'
 '   - Provide code templates for capturing and transmitting live video '
 'feeds.\n'
 '   - Explain how the system can analyze the video feeds i

#### Example 5: Teacher
Feel more confident in your next meeting by having the AI help you reason over facts. 

In [225]:
system = """You are an elite ai researcher specializing 
in the mineral industry. Your job is to take high-level 
facts and provide me the low-level details to understand 
the minutiae of the concepts."""

instruction = """Generate a list of the concepts and facts 
presented in the context, and then provide an expert description
for each one. In the descriptions, also include analysis of 
ways people commonly misconstrue or confuse the facts."""

answer = rag_chain.invoke({'context':context, 
                          'system':system,
                          'instruction':instruction})

pprint(answer)


('Concepts and Facts:\n'
 '\n'
 '1. Advanced Sensor Technologies:\n'
 '   - Integration of advanced sensors, including IoT devices, for real-time '
 'monitoring in mining operations.\n'
 '   - Optimization of processes, improvement of safety, and enhancement of '
 'resource efficiency.\n'
 '   - Common misconception: Advanced sensors are only used for monitoring, '
 'but they also contribute to process optimization and safety improvement.\n'
 '\n'
 '2. Automation and Robotics:\n'
 '   - Use of autonomous vehicles, drones, and robotic systems in mining '
 'operations.\n'
 '   - Increased efficiency and safety by reducing the need for human presence '
 'in hazardous environments.\n'
 '   - Common misconception: Automation and robotics only contribute to '
 'efficiency, but they also enhance safety and enable more precise and '
 'controlled mining activities.\n'
 '\n'
 '3. Artificial Intelligence (AI) and Machine Learning:\n'
 '   - Application of AI and machine learning technologies in d

# Wow! Right?!

You're thinking: "I could do a lot more with this concept." And you're 100% right! Contact me if you're interested in exploring ideas: [drifft.hello@gmail.com](mailto:drifft.hello@gmail.com)

Also, there's a reason VC's are chasing new tech in the resources industries. I write about it all in my [weekly newsletter](https://drifft.beehiiv.com) - free.

I'm also on [linkedin](https://linkedin.com/in/davidimprovz)  |  [twitter](https://twitter.com/d_comfe)
