### AWS
https://docs.aws.amazon.com/prescriptive-guidance/latest/patterns/automatically-extract-content-from-pdf-files-using-amazon-textract.html

https://docs.aws.amazon.com/AmazonS3/latest/userguide/IndexDocumentSupport.html

https://www.gormanalysis.com/blog/connecting-to-aws-s3-with-python/
https://realpython.com/python-boto3-aws-s3/


#### Sagemaker

https://docs.aws.amazon.com/sagemaker/latest/dg/whatis.html

https://docs.aws.amazon.com/sagemaker/latest/dg/how-it-works-mlconcepts.html

https://docs.aws.amazon.com/sagemaker/latest/dg/algos.html

https://docs.aws.amazon.com/sagemaker/latest/dg/blazingtext.html

starting Notebook

https://docs.aws.amazon.com/sagemaker/latest/dg/nbi.html

Git repo
https://github.com/aws/amazon-sagemaker-examples

## Use Case

## 1. Extract TextRact and Multi label Text classification 
### Problem Statement 

#### Extract text from pdf data using S3, Sagemaker and Textract

#### Task

1) Split large pdf stored in S3 into pages and store selected pages again to s3
2) Use textract to extract text from pdf


#### Approach
1) Connect to s3 using python 
https://www.gormanalysis.com/blog/connecting-to-aws-s3-with-python/

2) Create bucket and push pdf into bucket

https://medium.com/@vishal.sharma./create-an-aws-s3-bucket-using-aws-cli-5a19bc1fda79

3) Configure and setup Sagemaker notebook instance using Boto3 


```python
!pip install boto3
import boto3


s3 = boto3.resource(
    service_name='s3',
    region_name='us-east-2',
    aws_access_key_id='mykey',
    aws_secret_access_key='mysecretkey'
)

or 

s3_client=boto3.client()
response=s3_client.get_object(Bucket='bucketname',Key='filename')

# S3 bucket identifier
bucket = s3.Bucket(name="my_bucket")

```


4) Load or Download pdf from s3 to Sagemaker notebook \
https://towardsdatascience.com/how-to-read-data-files-on-s3-from-amazon-sagemaker-f288850bfe8f


5) Split pdf into pages and stored into s3 back again \
```python
from pdfrw import PdfReader, PdfWriter
```
```python
pages = PdfReader('inputfile.pdf').pages
parts = [(3,6),(7,10)]
for part in parts:
    outdata = PdfWriter(f'pages_{part[0]}_{part[1]}.pdf')
    for pagenum in range(*part):
        outdata.addpage(pages[pagenum-1])
    outdata.write()
```

#### Checks and Testing for performed task

1) Testing Reading pdf from s3 \


```python
!pip install pdfminer3 
!pip install PyPDF2
from pdfminer3.layout import LAParams, LTTextBox
from pdfminer3.pdfpage import PDFPage
from pdfminer3.pdfinterp import PDFResourceManager
from pdfminer3.pdfinterp import PDFPageInterpreter
from pdfminer3.converter import PDFPageAggregator
from pdfminer3.converter import TextConverter
from pdfminer.high_level import extract_pages
from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument

from urllib.parse import urlparse
from PyPDF2 import PdfFileWriter, PdfFileReader

import boto3
from io import BytesIO

resource_manager = PDFResourceManager()
file_handle = io.StringIO()
converter = TextConverter(resource_manager, file_handle, laparams=LAParams())
page_interpreter = PDFPageInterpreter(resource_manager, converter)

s3_client=boto3.client()
response=s3_client.get_object(Bucket='bucketname',Key='filename')
data = response['Body'].read()

for page in PDFPage.get_pages(io.BytesIO(data)):
    print(page)
    processed_page=page_interpreter.process_page(io.BytesIO(data))
    text=file_handle.getvalue()
    print(text)
    pdf=PDFPage.create_pages(text)
    
    output=PdffileWriter()
    print(output)
    
    file='test.pdf'
    s3.Bucket('Bucket_name').put_object(Key=file,Body=text)
    break
    
```

2) Reading pdf as to perform specific task
```python
#https://stackoverflow.com/questions/62799852/read-pdf-object-from-s3
import boto3
from PyPDF2 import PdfFileReader
from io import BytesIO

bucket_name ="pdf-forms-bucket"
item_name = "form.pdf"


s3 = boto3.resource('s3')
obj = s3.Object(bucket_name, item_name)
fs = obj.get()['Body'].read()
pdf = PdfFileReader(BytesIO(fs))

data = pdf.getFormTextFields()
```

2) Testing various splitting pdf methods locally 

```python
from PyPDF2 import PdfFileWriter, PdfFileReader

inputpdf = PdfFileReader(open("/path to pdf file directory/pdf_name.pdf", "rb"))

for i in range(inputpdf.numPages):
    output = PdfFileWriter()
    output.addPage(inputpdf.getPage(i))
    with open("document-page%s.pdf" % i, "wb") as outputStream:
        output.write(outputStream)
```

#### Issues 
https://stackoverflow.com/questions/65844539/how-to-use-pdfminer-to-extract-text-from-pdf-files-stored-in-s3-bucket-without-delattr \
https://stackoverflow.com/questions/66206423/how-to-upload-a-pdfusing-pdfpages-to-aws-s3-in-python \
https://stackabuse.com/example-upload-a-file-to-aws-s3-with-boto/ \
https://stackoverflow.com/questions/490195/split-a-multi-page-pdf-file-into-multiple-pdf-files-with-python \
https://programtalk.com/python-examples/pdfminer.pdfpage.PDFPage.create_pages/ \
https://www.analyticsvidhya.com/blog/2021/09/pypdf2-library-for-working-with-pdf-files-in-python/ \
https://realpython.com/pdf-python/  \
https://stackabuse.com/example-upload-a-file-to-aws-s3-with-boto/ \
https://www.blog.pythonlibrary.org/2018/04/11/splitting-and-merging-pdfs-with-python/
https://stackoverflow.com/questions/62799852/read-pdf-object-from-s3


## 2. Image classification 

#### Problem statement

Casting product image data for quality inspection

####  Tasks 
1. Download Data from Kaggle through s3
2. Stored data in s3 
3. Use Image libraries to visualize sample images


#### Reference
https://blog.jovian.ai/metal-casting-product-image-classification-for-quality-inspection-using-pytorch-72c696d205f3

https://aws.amazon.com/blogs/machine-learning/detect-manufacturing-defects-in-real-time-using-amazon-lookout-for-vision/

https://www.kaggle.com/ravirajsinh45/real-life-industrial-dataset-of-casting-product
https://www.kaggle.com/souvikg544/souvik-ghosh-casting


https://sagemakerexamples.readthedocs.io/en/latest/introduction_to_amazon_algorithms/imageclassification_caltech/Image-classification-fulltraining-highlevel.html


### 3. Amazon SageMaker and 🤗 Transformers: Train and Deploy a Summarization Model with a Custom Dataset

#### Reference 
https://towardsdatascience.com/amazon-sagemaker-and-transformers-train-and-deploy-a-summarization-model-with-a-custom-dataset-5efc589fedad
