## Data Science Portfolio - Authors Raw Dataset Cleaning and EDA Notebook ##

### Created by: Albert Schultz  ###

### Date Created: 05/23/2023 ###

### Version: 1.00 ###

### Executive Summary ###
This notebook goes over the process of importing raw data set of authors, book titles, and years to clean and prune the data set from abnormal spaces and strange characters that were not supposed to be there into clean data set variables. 

## Table of Contents ##

1. [Introduction](#1.-Introduction)
2. [Understanding Purpose, Goals, and Vision](#2.-Understanding-Purpose,-Goals,-and-Vision)
3. [Import Raw Data of Authors' Books Information](#3.-Import-Raw-Data-of-Authors'-Books-Information)
4. [Perform Data Cleaning and Manipulations](#4.-Perform-Data-Cleaning-and-Manipulations)
5. [Summary](#Summary)

## 1. Introduction ##

This project goes through the authors' list of poems variable, **highlighted_poems** and perform data cleaning to extractions of important information into listsm so the stakeholders can easily understand the cleaned authors' list. The stakeholders and collaborator can continue the analysis of the **highlighted_poems** data set. 

**Initialize the Notebook for data access, import library modules, and set the working directory for this project.**

In [1]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

## 2. Understanding Purpose, Goals, and Vision ##

The vision of the this Authors' Books of Poems Portfolio notebook is; to create a meaningful data presentation that allows stakeholders and collaborators alike to interact and continue the analysis of the cleaned data set of poem books. 

**Vision:** To showcase the cleaned data set of the books of poems with the authors and year associated to each book so others can easily take this data set further for analysis. 

**Goals:**
1. Review the raw data set of **highlighted_poems** and see what abnormalities the data set has and to stage the data for pruning and cleaning. 
2. Import the data set into the Python IDE for staging, extractions, data manipulations and presentation. 
3. Create lists of the author's name, the title of the book, and the year when the book was published. .
4. Perform exploratory data analysis (EDA) to understand the aspects of the books that were published by various authors and on various years. 
5. Present the cleaned findings of the EDA that shows various relationships between the years, the books, and authors. 

## 3. Import Raw Data of Authors' Books Information ##

**Introduction:** In this section, the following data sets would be imported into this notebook for cleaning, staging, and analysis. 

1. Create the required data sets below for analysis. 

In [3]:
highlighted_poems = "Afterimages:Audre Lorde:1997,  The Shadow:William Carlos Williams:1915, Ecstasy:Gabriela Mistral:1925,   Georgia Dusk:Jean Toomer:1923,   Parting Before Daybreak:An Qi:2014, The Untold Want:Walt Whitman:1871, Mr. Grumpledump's Song:Shel Silverstein:2004, Angel Sound Mexico City:Carmen Boullosa:2013, In Love:Kamala Suraiyya:1965, Dream Variations:Langston Hughes:1994, Dreamwood:Adrienne Rich:1987"

2. View the imported poem highlights data set. 

In [7]:
print(highlighted_poems)

Afterimages:Audre Lorde:1997,  The Shadow:William Carlos Williams:1915, Ecstasy:Gabriela Mistral:1925,   Georgia Dusk:Jean Toomer:1923,   Parting Before Daybreak:An Qi:2014, The Untold Want:Walt Whitman:1871, Mr. Grumpledump's Song:Shel Silverstein:2004, Angel Sound Mexico City:Carmen Boullosa:2013, In Love:Kamala Suraiyya:1965, Dream Variations:Langston Hughes:1994, Dreamwood:Adrienne Rich:1987


Looking through the raw data set, there are a lot of abnormalities that needs to be removed and taken care of by cleaning the data first before separating the titles, authors, and years into separate lists. 

## 4. Perform Data Cleaning and Manipulations ##

**Introduction:** In this section, it goes through the process of splitting the data by commas to appending and splitting the data set into separate lists by authors, titles, and years for further analysis. 

1. Split the **highlighted_poems** list separate book profiles using the **split()** method based on the "**,**".

In [9]:
highlighted_poems_list = highlighted_poems.split(',')

2. After using the **split()** method, view the data set.

In [10]:
print(highlighted_poems_list)

['Afterimages:Audre Lorde:1997', '  The Shadow:William Carlos Williams:1915', ' Ecstasy:Gabriela Mistral:1925', '   Georgia Dusk:Jean Toomer:1923', '   Parting Before Daybreak:An Qi:2014', ' The Untold Want:Walt Whitman:1871', " Mr. Grumpledump's Song:Shel Silverstein:2004", ' Angel Sound Mexico City:Carmen Boullosa:2013', ' In Love:Kamala Suraiyya:1965', ' Dream Variations:Langston Hughes:1994', ' Dreamwood:Adrienne Rich:1987']


3. Create three empty variable lists **highlighted_poems_stripped, highlighted_poems_details, and titles** to use the for loop to append the data separately into the separte variable lists. 

In [11]:
highlighted_poems_stripped = []
highlighted_poems_details = []
titles = []

4. Create a **for loop** that iterates through the **highlighted_poems_list** and append the information of the cleaned data from having no spaces before and after inside of the cell in the set to the empty list, **highlighted_poems_stripped**. 

In [12]:
for poem in highlighted_poems_list: #First for loop extract the spaces and append changes to highlighted_poems_stripped list.
    highlighted_poems_stripped.append(poem.strip())

5. Create a **for loop** that iterates through the **highlighted_poems_list** and append the information that has no **";"** in the set to the empty list, **highlighted_poems_details**. 

In [13]:
for highlights in highlighted_poems_stripped: #Second for loop that append the highlight details with changes to the stripped list that was appended for the strip() method and further separate the words if they have : beteen them.
    highlighted_poems_details.append(highlights.split(':'))

6. View the semi-clean data set so far. 

In [14]:
print(highlighted_poems_details)

[['Afterimages', 'Audre Lorde', '1997'], ['The Shadow', 'William Carlos Williams', '1915'], ['Ecstasy', 'Gabriela Mistral', '1925'], ['Georgia Dusk', 'Jean Toomer', '1923'], ['Parting Before Daybreak', 'An Qi', '2014'], ['The Untold Want', 'Walt Whitman', '1871'], ["Mr. Grumpledump's Song", 'Shel Silverstein', '2004'], ['Angel Sound Mexico City', 'Carmen Boullosa', '2013'], ['In Love', 'Kamala Suraiyya', '1965'], ['Dream Variations', 'Langston Hughes', '1994'], ['Dreamwood', 'Adrienne Rich', '1987']]


7. Determine the numbers of book profiles in the cleaned data set. 

In [15]:
print(len(highlighted_poems_details))

11


8. Iterate through the **highlighted_poems_details** and print on each line the book profile with the author's name, title, and year. 

In [16]:
index = 0
while index < len(highlighted_poems_details):
    print(highlighted_poems_details[index])
    index += 1

['Afterimages', 'Audre Lorde', '1997']
['The Shadow', 'William Carlos Williams', '1915']
['Ecstasy', 'Gabriela Mistral', '1925']
['Georgia Dusk', 'Jean Toomer', '1923']
['Parting Before Daybreak', 'An Qi', '2014']
['The Untold Want', 'Walt Whitman', '1871']
["Mr. Grumpledump's Song", 'Shel Silverstein', '2004']
['Angel Sound Mexico City', 'Carmen Boullosa', '2013']
['In Love', 'Kamala Suraiyya', '1965']
['Dream Variations', 'Langston Hughes', '1994']
['Dreamwood', 'Adrienne Rich', '1987']


9. Print the indexes to screen to test out the **highlighted_poems_details** functionality to set a base template for the while loops to iterate through that list and append individual attributes for the empty lists; authors, titles, and years. 

In [17]:
print(highlighted_poems_details[0][0]) #Get test title of book.
print(highlighted_poems_details[0][1]) #Get test author of book.
print(highlighted_poems_details[0][2]) #Get test year of book.

Afterimages
Audre Lorde
1997


10. Iterate through the **highlighted_poems_details** variable list and extract only the **title** and append the titles to the empty **titles** list. 

In [18]:
#Create empty lists for the authors, titles, and years.
authors = []
titles = []
years = []

index = 0 #Initialize at 0.
#While loop to split the title into the titles empty list.
while index < len(highlighted_poems_details):
    titles.append(highlighted_poems_details[index][0])
    index += 1
print(titles) #Cleaned titles list.

['Afterimages', 'The Shadow', 'Ecstasy', 'Georgia Dusk', 'Parting Before Daybreak', 'The Untold Want', "Mr. Grumpledump's Song", 'Angel Sound Mexico City', 'In Love', 'Dream Variations', 'Dreamwood']


11. Print the titles list's count. 

In [19]:
print(f"The total titles is {str(len(titles))}.")

The total titles is 11.


12. Iterate through the **highlighted_poems_details** variable list and extract only the **author** and append the authors to the empty **authors** list. 

In [20]:
index = 0 #Initialize at 0.
#Write loop to split the authors into the authors empty list.
while index < len(highlighted_poems_details):
    authors.append(highlighted_poems_details[index][1])
    index += 1
print(authors) #Cleaned authors list.

['Audre Lorde', 'William Carlos Williams', 'Gabriela Mistral', 'Jean Toomer', 'An Qi', 'Walt Whitman', 'Shel Silverstein', 'Carmen Boullosa', 'Kamala Suraiyya', 'Langston Hughes', 'Adrienne Rich']


13. Print the authors list's count. 

In [21]:
print(f"The total authors is {str(len(authors))}.")

The total authors is 11.


14. Iterate through the **highlighted_poems_details** variable list and extract only the **year** and append the years to the empty **years** list. 

In [22]:
index = 0 #Initialize at 0.
#Write loop to split the years into the years empty list.
while index < len(highlighted_poems_details):
    years.append(highlighted_poems_details[index][2])
    index += 1
print(years) #Cleaned years list.

['1997', '1915', '1925', '1923', '2014', '1871', '2004', '2013', '1965', '1994', '1987']


15. Print the years list's count. 

In [23]:
print(f"The total years is {str(len(years))}.")

The total years is 11.


## Summary ##

This project portfolio can help others new to data science to understand the step by step process of cleaning the authors' book profiles data set and putting the data into meaningful printouts for others to understand and take this project further should they wish to do so. 

In [24]:
index = 0 #Initialize at 0.
while index < len(highlighted_poems_details):
    print(f"The author {authors[index]}, wrote the book {titles[index]}, which was published in {str(years[index])}.")
    index += 1

The author Audre Lorde, wrote the book Afterimages, which was published in 1997.
The author William Carlos Williams, wrote the book The Shadow, which was published in 1915.
The author Gabriela Mistral, wrote the book Ecstasy, which was published in 1925.
The author Jean Toomer, wrote the book Georgia Dusk, which was published in 1923.
The author An Qi, wrote the book Parting Before Daybreak, which was published in 2014.
The author Walt Whitman, wrote the book The Untold Want, which was published in 1871.
The author Shel Silverstein, wrote the book Mr. Grumpledump's Song, which was published in 2004.
The author Carmen Boullosa, wrote the book Angel Sound Mexico City, which was published in 2013.
The author Kamala Suraiyya, wrote the book In Love, which was published in 1965.
The author Langston Hughes, wrote the book Dream Variations, which was published in 1994.
The author Adrienne Rich, wrote the book Dreamwood, which was published in 1987.
