### From Data to Reports: AI-Powered Analysis and PDF Generation

### Introduction

This project demonstrates an automated analysis of monthly sales data using the state-of-the-art language model **SmolLM3-3B** from Hugging Face. The goal is to showcase how AI can assist in summarizing sales trends and generating insightful reports based on raw sales figures.

The dataset includes monthly sales from March to July, measured in thousands of Turkish Lira (k TRY). Following sections present the raw data, the AI-generated summary, and the final sales report exported as a PDF for easy distribution.



---

### Step 1: Required libraries

```python
!pip install -q transformers weasyprint matplotlib pandas
```

* Installs the necessary Python libraries.
* `transformers`: For NLP models by Hugging Face.
* `weasyprint`: Converts HTML to PDF.
* `matplotlib`: For plotting (not used in this code).
* `pandas`: For data manipulation and tables.
* The `-q` flag suppresses unnecessary output during installation.

In [1]:
# Step 1: Required libraries
!pip install -q transformers weasyprint matplotlib pandas

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m298.0/298.0 kB[0m [31m8.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.1/2.1 MB[0m [31m47.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.9/2.9 MB[0m [31m81.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m850.6/850.6 kB[0m [31m37.4 MB/s[0m eta [36m0:00:00[0m
[?25h

---

### Step 2: Imports

```python
from transformers import pipeline, AutoTokenizer
import pandas as pd
import matplotlib.pyplot as plt
from datetime import datetime
from zoneinfo import ZoneInfo
import base64
from weasyprint import HTML
import re
```

* Imports required modules.
* `pipeline` and `AutoTokenizer`: For easy use of the model.
* `pandas`: DataFrame operations.
* `matplotlib.pyplot`: Plotting (unused here).
* `datetime` and `ZoneInfo`: Date and time handling (Istanbul timezone).
* `base64`: For data encoding (unused here).
* `HTML` (from WeasyPrint): To convert HTML to PDF.
* `re`: Regular expressions for text processing.

In [2]:
# Step 2: Imports
from transformers import pipeline, AutoTokenizer
import pandas as pd
import matplotlib.pyplot as plt
from datetime import datetime
from zoneinfo import ZoneInfo
import base64
from weasyprint import HTML
import re

---

### Step 3: Data

```python
df = pd.DataFrame({
    "Month": ["March", "April", "May", "June", "July"],
    "Sales (k TRY)": [100, 150, 200, 250, 300]
})
```

* Creates a DataFrame with sales data.
* Months from March to July, sales in thousand Turkish Lira.
* This data will be used for analysis.



In [3]:
# Step 3: Data
df = pd.DataFrame({
    "Month": ["March", "April", "May", "June", "July"],
    "Sales (k TRY)": [100, 150, 200, 250, 300]
})

---

### Step 4: Load model

```python
model_id = "HuggingFaceTB/SmolLM3-3B"
tokenizer = AutoTokenizer.from_pretrained(model_id)
pipe = pipeline("text-generation", model=model_id, tokenizer=tokenizer, device=0)
```

* Loads the "SmolLM3-3B" model from Hugging Face.
* `AutoTokenizer` prepares text inputs for the model.
* `pipeline` sets up text generation task.
* `device=0` uses GPU if available.



In [4]:
# Step 4: Load model
model_id = "HuggingFaceTB/SmolLM3-3B"
tokenizer = AutoTokenizer.from_pretrained(model_id)
pipe = pipeline("text-generation", model=model_id, tokenizer=tokenizer, device=0)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenizer.json:   0%|          | 0.00/17.2M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/289 [00:00<?, ?B/s]

chat_template.jinja: 0.00B [00:00, ?B/s]

config.json: 0.00B [00:00, ?B/s]

model.safetensors.index.json: 0.00B [00:00, ?B/s]

Fetching 2 files:   0%|          | 0/2 [00:00<?, ?it/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/1.18G [00:00<?, ?B/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.97G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/182 [00:00<?, ?B/s]

Device set to use cuda:0


---

### Step 5: Prompt the model with the data

```python
df_prompt = df.to_string(index=False)
prompt = f"Analyze the sales data below and provide a concise summary of trends and insights:\n\n{df_prompt}\n"
messages = [{"role": "user", "content": prompt}]
```

* Converts DataFrame to string for the prompt.
* Asks the model to analyze and summarize sales trends.
* Prepares the input message with role `user`.


In [5]:
# Step 5: Prompt the model with the data
df_prompt = df.to_string(index=False)
prompt = f"Analyze the sales data below and provide a concise summary of trends and insights:\n\n{df_prompt}\n"
messages = [{"role": "user", "content": prompt}]

---

### Step 6: Get first part of the response

```python
result = pipe(messages, max_new_tokens=350)
first_part = result[0]['generated_text'][-1]['content']
```

* Sends prompt to the model, generating up to 350 tokens.
* Extracts the first part of the generated response.
* (Note: The indexing here seems off; typically `result[0]['generated_text']` suffices.)

In [6]:
# Step 6: Get first part of the response
result = pipe(messages, max_new_tokens=350)
first_part = result[0]['generated_text'][-1]['content']

# Optional: Remove prompt from beginning if present
if first_part.startswith(prompt):
    first_part = first_part[len(prompt):].strip()

---

### Step 7: Continue if the text ends abruptly

```python
continuation_prompt = [{"role": "user", "content": f"Please continue this analysis: {first_part[-300:]}"}]
continued = pipe(continuation_prompt, max_new_tokens=150)
second_part = continued[0]['generated_text'][-1]['content']
```

* If the response is cut off, asks the model to continue using the last 300 characters.
* Generates up to 150 more tokens for continuation.

---

### Optional: Remove repetition in continuation

```python
if second_part.startswith("Please continue this analysis:"):
    second_part = second_part[len("Please continue this analysis:"):].strip()
```

* Cleans up repeated prompt text at the start of the continuation response.

---

### Combine parts

```python
ai_summary_raw = (first_part + " " + second_part).strip()
```

* Joins the first and second parts of the response.

In [7]:
# Step 7: Continue if the text ends abruptly
continuation_prompt = [{"role": "user", "content": f"Please continue this analysis: {first_part[-300:]}"}]
continued = pipe(continuation_prompt, max_new_tokens=150)
second_part = continued[0]['generated_text'][-1]['content']

# Optional: Remove repetition in continuation
if second_part.startswith("Please continue this analysis:"):
    second_part = second_part[len("Please continue this analysis:"):].strip()

# Combine
ai_summary_raw = (first_part + " " + second_part).strip()

---

### Step 8: Clean final output

```python
def fix_incomplete_sentence(text):
    sentences = re.split(r'(?<=[.!?]) +', text.strip())
    if not sentences:
        return text.strip()
    last_sentence = sentences[-1]
    if not re.search(r'[.!?]$', last_sentence):
        sentences = sentences[:-1]
    clean_text = ' '.join(sentences).strip()
    if clean_text and not clean_text.endswith(('.', '!', '?')):
        clean_text += " In conclusion, the data demonstrates a consistent upward trend in sales."
    return clean_text

ai_summary = fix_incomplete_sentence(ai_summary_raw)
```

* Splits text into sentences.
* Removes incomplete last sentence if needed.
* Adds a concluding sentence if missing.
* Returns cleaned summary.

---

In [8]:
# Step 8: Clean final output
def fix_incomplete_sentence(text):
    sentences = re.split(r'(?<=[.!?]) +', text.strip())
    if not sentences:
        return text.strip()
    last_sentence = sentences[-1]
    if not re.search(r'[.!?]$', last_sentence):
        sentences = sentences[:-1]
    clean_text = ' '.join(sentences).strip()
    if clean_text and not clean_text.endswith(('.', '!', '?')):
        clean_text += " In conclusion, the data demonstrates a consistent upward trend in sales."
    return clean_text

ai_summary = fix_incomplete_sentence(ai_summary_raw)

---

### Step 9: Prepare date and filename

```python
now = datetime.now(ZoneInfo("Europe/Istanbul"))
formatted_date = now.strftime("%A, %d %B %Y, %H:%M")
timestamp = now.strftime("%Y%m%d_%H%M%S")
filename = f"ai_sales_report_{timestamp}.pdf"
```

* Gets current date and time in Istanbul timezone.
* Formats date in human-readable form.
* Creates a timestamp string for a unique filename.
* Prepares the PDF filename.


In [13]:
# Step 9: Prepare date and filename
now = datetime.now(ZoneInfo("Europe/Istanbul"))
formatted_date = now.strftime("%A, %d %B %Y, %H:%M")
timestamp = now.strftime("%Y%m%d_%H%M%S")
filename = f"ai_sales_report_{timestamp}.pdf"

---

### Step 10: Convert table to HTML

```python
table_html = df.to_html(index=False, border=0, classes="dataframe")
table_style = """
<style>
  .dataframe {
    border-collapse: collapse;
    width: 100%;
    margin-top: 1cm;
  }
  .dataframe th, .dataframe td {
    border: 1px solid #666;
    padding: 8px;
    text-align: center;
    font-size: 1em;
  }
  .dataframe th {
    background-color: #f2f2f2;
  }
</style>
"""
```

* Converts DataFrame to an HTML table.
* Adds CSS styles for borders, padding, and fonts.



In [14]:
# Step 10: Convert table to HTML
table_html = df.to_html(index=False, border=0, classes="dataframe")
table_style = """
<style>
  .dataframe {
    border-collapse: collapse;
    width: 100%;
    margin-top: 1cm;
  }
  .dataframe th, .dataframe td {
    border: 1px solid #666;
    padding: 8px;
    text-align: center;
    font-size: 1em;
  }
  .dataframe th {
    background-color: #f2f2f2;
  }
</style>
"""

---

### Step 11: Full HTML content

```python
html = f"""
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
{table_style}
<style>
  body {{ font-family: DejaVu Sans, sans-serif; margin: 2cm; }}
  h1 {{ text-align: center; }}
  .date {{ text-align: right; font-size: 0.8em; margin-bottom: 0.5cm; }}
  .author {{ text-align: right; font-size: 0.8em; margin-bottom: 0.5cm; }}
  p {{ font-size: 1.1em; line-height: 1.5; }}
</style>
</head>
<body>

<h1>Monthly Sales Report</h1>
<div class="date">Date: {formatted_date}</div>
<div class="author">Generated by: AI Analysis Unit (SmolLM3-3B)</div>

<h2>Sales Table</h2>
{table_html}

<h2>AI Review</h2>
<p>{ai_summary}</p>

</body>
</html>
"""
```

* Creates the full HTML page for the report.
* Includes styles, header, date, author info, sales table, and AI-generated summary.

In [15]:
# Step 11: Full HTML content
html = f"""
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
{table_style}
<style>
  body {{ font-family: DejaVu Sans, sans-serif; margin: 2cm; }}
  h1 {{ text-align: center; }}
  .date {{ text-align: right; font-size: 0.8em; margin-bottom: 0.5cm; }}
  .author {{ text-align: right; font-size: 0.8em; margin-bottom: 0.5cm; }}
  p {{ font-size: 1.1em; line-height: 1.5; }}
</style>
</head>
<body>

<h1>Monthly Sales Report</h1>
<div class="date">Date: {formatted_date}</div>
<div class="author">Generated by: AI Analysis Unit (SmolLM3-3B)</div>

<h2>Sales Table</h2>
{table_html}

<h2>AI Review</h2>
<p>{ai_summary}</p>

</body>
</html>
"""

---

### Step 12: Export to PDF

```python
HTML(string=html).write_pdf(filename)
print(f"✅ PDF generated: {filename}")
```

* Converts the HTML string into a PDF file using WeasyPrint.
* Saves it with the generated filename.
* Prints a confirmation message.

In [16]:
# Step 12: Export to PDF
HTML(string=html).write_pdf(filename)
print(f"✅ PDF generated: {filename}")

DEBUG:fontTools.ttLib.ttFont:Reading 'maxp' table from disk
DEBUG:fontTools.ttLib.ttFont:Decompiling 'maxp' table
DEBUG:fontTools.subset.timer:Took 0.002s to load 'maxp'
DEBUG:fontTools.subset.timer:Took 0.000s to prune 'maxp'
INFO:fontTools.subset:maxp pruned
DEBUG:fontTools.ttLib.ttFont:Reading 'cmap' table from disk
DEBUG:fontTools.ttLib.ttFont:Decompiling 'cmap' table
DEBUG:fontTools.ttLib.ttFont:Reading 'post' table from disk
DEBUG:fontTools.ttLib.ttFont:Decompiling 'post' table
DEBUG:fontTools.subset.timer:Took 0.004s to load 'cmap'
DEBUG:fontTools.subset.timer:Took 0.000s to prune 'cmap'
INFO:fontTools.subset:cmap pruned
INFO:fontTools.subset:fpgm dropped
INFO:fontTools.subset:prep dropped
INFO:fontTools.subset:cvt  dropped
DEBUG:fontTools.subset.timer:Took 0.000s to load 'post'
DEBUG:fontTools.subset.timer:Took 0.000s to prune 'post'
INFO:fontTools.subset:post pruned
DEBUG:fontTools.ttLib.ttFont:Reading 'glyf' table from disk
DEBUG:fontTools.ttLib.ttFont:Decompiling 'glyf' tabl

✅ PDF generated: ai_sales_report_20250718_150046.pdf


### Conclusion

This notebook highlights the power of combining data science with AI language models to streamline business reporting. The automated analysis delivers fast and concise insights into sales performance, helping analysts and decision-makers save time.

Future work could enhance the report by adding richer datasets, incorporating visualizations, and refining AI prompts for deeper analysis. Overall, this example illustrates how AI can empower data-driven decision-making in a practical setting.
