# Convert Photos to Text

During this portion of the project, the goal is to develop a technique for successfully converting sample screenshots to text. This notebook also lays the groundwork for the next phase of the project where I will work on training an LSTM model on sequences of text that have been extracted from photographs, and sequences of the same text that have been cleaned up and formatted properly. This is the first step in creating an architecture that can successfully take a photo as input and produce a clean, structured output.

## Set Up

In [1]:
## Necessary packages
import cv2
import pytesseract
import pandas as pd
import os
import re
import requests
from bs4 import BeautifulSoup
from utils import *
os.environ['TESSDATA_PREFIX'] = '/home/ec2-user/tesseract/tessdata'


## Convert Sample Screenshot to Text

In [2]:
## Load the sample screenshot
image = cv2.imread('Oceania_sunlight_hours.png')

## Convert the image to grayscale
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)

## Perform OCR to convert the image to string
text = pytesseract.image_to_string(image)

In [3]:
## Perform OCR to convert the image to string
text = pytesseract.image_to_string(image)

In [4]:
text

'Country City ¢ Jan ¢ Feb ¢ Mar ¢ Apr ¢ May ¢ (Jun ¢ |[Jul ¢ Aug ¢ Sep ¢ Oct ¢+ Nov ¢ Dec ¢ Year + Ref. ¢\n\nAustralia EZ:EM 285.2 257.1 2852 2940 300.7 2940 3162 3286 306.0 313.1 2940 291.4 3,5655 [187)\nAustralia Oodnadatta | 337.9 315.0 313.1 273.0 2449 231.0 2542 2759 291.0 3162 321.0 341.0 35142 [1%8\nAustralia Broome 257.3 2128 2635 2940 2914 2820 3069 3255 3120 3379 3360 291.4 35107 [189\nAustralia Alice Springs | 306.0 276.8 300.7 2850 263.5 2520 2821 3069 300.0 313.1 303.0 310.0 3,499.1 ?\nAustralia Perth 356.5 3149 2955 246.0 211.7 180.6 188.4 219.8 2324 299.8 320.4 359.4 3,229.5 ?\nAustralia Townsville 2542 211.9 2449 2439 2449 231.0 2635 279.0 291.0 3069 291.0 2883 3,141.1 [190\nAustralia Darwin 176.7 162.4 2108 261.0 297.6 297.0 313.1 319.3 297.0 291.4 2520 2139 3,0922 ?\nAustralia Brisbane 263.5 2232 2325 2340 2356 198.0 2387 266.6 270.0 2759 270.0 2654 2,968.4 ?\nAustralia Canberra 2045 2543 2511 219.0 186.0 156.0 179.8 217.0 231.0 266.6 267.0 291.4 2,813.7 ?\nAustralia 

## Create Format for Training Set

In [5]:
## URL of the Wikipedia page
url = 'https://en.wikipedia.org/wiki/List_of_cities_by_sunshine_duration#Oceania'

## Send a GET request to the URL
response = requests.get(url)

## Create a BeautifulSoup object to parse the HTML content
soup = BeautifulSoup(response.content, 'html.parser')

## Extract the correct table from the Wikipedia page
oceania_table = soup.find('span', {'id': 'Oceania'}).find_next('table')

## Extract the rows from the HTML table
rows = oceania_table.find_all('tr')

In [8]:
## Use build_table() function from utils.py
df = build_table(rows)

In [9]:
df.to_string()

'             Country           City    Jan    Feb    Mar    Apr    May    Jun    Jul    Aug    Sep    Oct    Nov    Dec     Year   Ref.\n0          Australia  Tennant Creek  285.2  257.1  285.2  294.0  300.7  294.0  316.2  328.6  306.0  313.1  294.0  291.4  3,565.5  [187]\n1          Australia     Oodnadatta  337.9  315.0  313.1  273.0  244.9  231.0  254.2  275.9  291.0  316.2  321.0  341.0  3,514.2  [188]\n2          Australia         Broome  257.3  212.8  263.5  294.0  291.4  282.0  306.9  325.5  312.0  337.9  336.0  291.4  3,510.7  [189]\n3          Australia  Alice Springs  306.0  276.8  300.7  285.0  263.5  252.0  282.1  306.9  300.0  313.1  303.0  310.0  3,499.1      ?\n4          Australia          Perth  356.5  314.9  295.5  246.0  211.7  180.6  188.4  219.8  232.4  299.8  320.4  359.4  3,229.5      ?\n5          Australia     Townsville  254.2  211.9  244.9  243.9  244.9  231.0  263.5  279.0  291.0  306.9  291.0  288.3  3,141.1  [190]\n6          Australia         Darwin  176

## Build Functions for Generating a Training Set

In [10]:
def img_to_txt(image_file_name):
    
    import cv2
    import pytesseract
    
    ## Load the sample screenshot
    image = cv2.imread(image_file_name)
    ## Convert the image to grayscale
    gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)

    ## Perform OCR to convert the image to string
    text = pytesseract.image_to_string(image)
    
    return text