# Data Processing

Data is the lifeblood of modern businesses and organizations. 

## Table of Contents
   1. [Text](#text)
   2. [DOCX](#docx)
   3. [PDF](#pdf)
   4. [Excel](#excel)
   5. [PPT](#ppt)
   6. [Image](#image) 
   7. [json](#json)

In [44]:
# Import modules

import os
from docx import Document # For DOCX processing

### Text
Plain text format, the most basic format for text data. It is a simple text file with no formatting or metadata. It is the most common format for text data and is used in most NLP tasks.

In [43]:
# Read TXT file

current_directory = os.getcwd()
parent_folder = os.path.abspath(os.path.join(current_directory, '..', '..'))
txt_path = parent_folder + '/data/model_audit.txt'

with open(txt_path, 'r', encoding='utf-8') as file:
    data_txt = file.read()

print(data_txt[:300])

Best Practices for Auditing a Model
When modeling, I encourage you to always bear this single question at the back of your mind: “Am I making this model easily auditable?” because for every task executed, formula created, and link built, there will always be a faster, “dirtier” (in industry parlance


### DOCX
Word processing format used for document creation. It is a proprietary format developed by Microsoft and is the most common format for word processing. It is a binary file format and is not human-readable.

In [50]:
# Read DOCX file

docx_path = parent_folder + '/data/ai-report.docx'

doc = Document(docx_path)
data_docx = '\n'.join([paragraph.text for paragraph in doc.paragraphs])

print(data_docx[:500])





Artificial Intelligence and the Future of Teaching and Learning
Insights and Recommendations
May 2023

Artificial Intelligence and the Future of Teaching and Learning
Miguel A. Cardona, Ed.D.
Secretary, U.S. Department of Education
Roberto J. Rodríguez
Assistant Secretary, Office of Planning, Evaluation, and Policy Development
Kristina Ishmael
Deputy Director, Office of Educational Technology May 2023
Examples Are Not Endorsements
This document contains examples and resource materials that a


### PDF
Portable Document Format, a standard for document sharing. It is a proprietary format developed by Adobe and is the most common format for document sharing. It is a binary file format and is not human-readable. 

In [3]:
from pdfminer.high_level import extract_text

def extract_text_from_pdf(pdf_path):
    return extract_text(pdf_path)

pdf_path = './data/setting-up-and-sustaining-ai-success.pdf'
data_pdf = extract_text_from_pdf(pdf_path)
print(data_pdf[:500])

AI Strategy
Setting up and sustaining AI success in uncertain times

AI Strategy

Executive Summary
Rethink AI strategy for resilience and 
business results in uncertain times

AI strategy is the ability to set and communicate a 
vision, define a roadmap, and build a business case 
for how artificial intelligence will be used over time 
at an organizational level. This includes the levels of 
leadership engagement required to create sustainable 
AI within the organization and ROI for the busine



### Excel
Spreadsheet format for tabular data and calculations.

In [4]:
import pandas as pd

data = pd.read_excel("./data/weather_data_2022.xlsx")
print(data.head())

        Date  Temperature (C)  Humidity (%)  Precipitation (mm)   
0 2022-01-01        29.565708            75            8.408121  \
1 2022-01-02        27.412771            75            4.755401   
2 2022-01-03        -9.296075            47            5.929933   
3 2022-01-04        10.167161            38            7.858446   
4 2022-01-05        -7.061943            40            5.544814   

   Wind Speed (km/h)  
0                 13  
1                 19  
2                  7  
3                  9  
4                 11  


### PPT
Presentation format for slideshows.

In [5]:
from pptx import Presentation

def extract_text_from_ppt(ppt_path):
    prs = Presentation(ppt_path)
    text = ""
    for slide in prs.slides:
        for shape in slide.shapes:
            if hasattr(shape, "text"):
                text += shape.text + "\n"
    return text

ppt_path = './data/capabilityAzureOpenAI.pptx'
data_ppt = extract_text_from_ppt(ppt_path)
print(data_ppt)


Capability overview of Azure OpenAI
Completion
Capability overview of Azure OpenAI
Embeddings



### Image
Image format for visual data which could be png, jpg, jpeg etc. For the image OCR part using Tesseract and its Python wrapper pytesseract, you'll need to install both the Tesseract OCR software and the pytesseract Python package. Additionally, the Pillow package is necessary for image processing.

1. Tesseract OCR Software Installation:
    - Ubuntu/Debian:
        ```bash	
        sudo apt-get install tesseract-ocr
        ```
    - Windows:
        Download the binary from [Github](https://github.com/UB-Mannheim/tesseract/wiki) and install it. After installation, ensure the path to Tesseract is in your system's PATH or provide the path explicitly in your Python code.

    - MacOS:

        bash
        ```bash	
        brew install tesseract
        ```
2. Python Packages Installation:
    You can install the required Python packages using pip:
    ```bash	
    % pip install pytesseract Pillow
    ```
After these installations, you should be set up to perform OCR on images.

In [6]:
from PIL import Image
import pytesseract

def ocr_from_image(image_path):
    image = Image.open(image_path)
    text = pytesseract.image_to_string(image)
    return text

image_path = './data/ai-adoption.png'
extracted_text = ocr_from_image(image_path)
print(extracted_text)

THE FIVE COMMANDMENTS OF ENTERPRISE Al ADOPTION

CONSTRUCT DRIVE A CHANGE STRUCTURES FOCUS ON
A ROBUST DATA-CENTRIC AND CULTURE TO BUSINESS AND SCALE AT SPEED
Al STRATEGY ENTERPRISE IMPROVE ADOPTION HUMAN CENTRICITY

 _— —— —— —_—__
Build an Al strategy Assess data Make necessary Contextualize Al Leverage CoEs
that supports your complexities operating model projects with and the partner
business strategy and define a changes to drive organizational goals ecosystem to
andlinvaescs path to becoming enterprise-wide to evaluate ROI and jumpstart and
existing IT a data-centric adoption business impact ease adoption
capabilities enter se.

*
Source: Everest Group | Infographic design by Antonio Grasso hettaloghy deltalogix.blog



In [7]:
# Initial setup for Azure OpenAI API
import os
import openai
from dotenv import load_dotenv

# Load environment variables from .env file
dotenv_path = os.path.join(os.path.dirname(os.getcwd()), '.env')  # Assumes .env is in the parent directory of your notebook
load_dotenv(dotenv_path)

# Access environment variables
AZURE_OPENAI_API_KEY = os.environ.get('AZURE_OPENAI_KEY')
AZURE_OPENAI_ENDPOINT = os.environ.get('AZURE_OPENAI_ENDPOINT')
AZURE_OPENAI_VERSION = "2022-12-01"

# Set OpenAI API configuration
openai.api_type = "azure"
openai.api_key = AZURE_OPENAI_API_KEY
openai.api_base = AZURE_OPENAI_ENDPOINT
openai.api_version = AZURE_OPENAI_VERSION # this may change in the future

# Setting constant for text-davinci-003 model used, name of deployment in azure resource
deployment_name = "text-davinci-003"

prompt_image = "Explain in detail with proper formatting and summarize the extracted text from OCR on the image.\n\nExtracted Text:\n" + extracted_text + "\n\nSummary:"
response = openai.Completion.create(
        engine = deployment_name,
        prompt = prompt_image,
        temperature = 0.2,
        max_tokens = 800
    )

print(response.choices[0].text)


This infographic provides five commandments for successful enterprise Al adoption. These include building an Al strategy that supports the business strategy, assessing data complexities, making necessary operating model changes, contextualizing Al projects with organizational goals, and leveraging CoEs and the partner ecosystem to evaluate ROI and ease adoption. These steps will help organizations become data-centric and improve their Al adoption.


### json
JavaScript Object Notation, a standard for data interchange. It is a text-based format and is human-readable.

In [8]:
import json

def extract_data_from_json(json_path):
    with open(json_path, 'r') as json_file:
        data = json.load(json_file)
    return data

json_path = './data/movies.json'
data = extract_data_from_json(json_path)

# Pretty print the extracted JSON data
data_json = json.dumps(data, indent=4)
print(data_json)


[
    {
        "Title": "Inception",
        "Year": 2010,
        "Genre": "Science Fiction",
        "Director": "Christopher Nolan",
        "Rating": 8.8
    },
    {
        "Title": "The Shawshank Redemption",
        "Year": 1994,
        "Genre": "Drama",
        "Director": "Frank Darabont",
        "Rating": 9.3
    },
    {
        "Title": "The Dark Knight",
        "Year": 2008,
        "Genre": "Action",
        "Director": "Christopher Nolan",
        "Rating": 9.0
    },
    {
        "Title": "Pulp Fiction",
        "Year": 1994,
        "Genre": "Crime",
        "Director": "Quentin Tarantino",
        "Rating": 8.9
    },
    {
        "Title": "The Godfather",
        "Year": 1972,
        "Genre": "Crime",
        "Director": "Francis Ford Coppola",
        "Rating": 9.2
    },
    {
        "Title": "Forrest Gump",
        "Year": 1994,
        "Genre": "Drama",
        "Director": "Robert Zemeckis",
        "Rating": 8.8
    },
    {
        "Title": "The Matrix

## Conclusion and Further Reading