# Exercise Fourteen: Project Design Starter
In this exercise, you'll be planning out a complex project. You'll draw in some code, but focus on commenting to describe your project structure. The sample document below will guide you through organizing and annotating your project design. The primary components you'll include are:

- Dependencies: What modules will your project need?
- Collection: Where is your data coming from?
- Processing: How will you format and process your data?
- Analysis: What techniques will you use to understand your data?
- Visualization: How will you visualize and explore your data?

Don't worry if you aren't exactly certain how you would implement everything - this should be a starting point for a larger research study, but it doesn't need to be a complete, functional workflow. Aim for a "good enough" starting point that you can reference and extend for future work.

Note where you have something working, and where it's broken or in progress.

And to help push the technical philosophy further and keep the process moving was knowing that "An interface is a set of cognitive cues. It may look like a screen full of pictures of things inside the computer, but in fact, the interface mediates between an individual the computational activity" (Drucker 176).

"When people change how they speak or act in order to conform to dominant norms, we call it “code-switching”" (Benjamin 180). 

"Data, in short, do not speak for themselves and don’t always change hearts and minds or policy" (Benjamin 192). 

## BONUS Project Overview: IMDb

Using Beautiful Soup, I wanted to show the process of bringing in the elements scraped from IMDb data for this inquiry.


## Stage One: Dependencies

As a result, comfort has been established as BeautifulSoup enables movement to work without an API. Enforcing beginning with "a problem or a question. If your problem or question is not well defined, develop or find one which is" (Karsdorp, Kestemont, Riddell 323) comes alive.

With the help of the path being "director/," I was off and running, as "An interface can connect a person with a computer.., a computer with a computer (as in an API)" (Drucker 172).


Other than representing my abbreviations of identity, IMDb is an Amazon setup toward the entertainment industry about film and cinema. I am bringing information attainable at the primary level toward consumer interest of film backgrounds to IMDb Pro stipulations that aid industry searches toward gaining knowledge and industry contacts. And with the enforcement of Beautiful Soup, the process begins packaged.

In [22]:
# Python Package imports
import requests
from bs4 import BeautifulSoup
from dateutil.parser import parse
import concurrent.futures
import pandas as pd

url= "https://www.imdb.com/search/title?count=100&title_type=feature,tv_series&ref_=nv_wl_img_2"

With Python's API, the need to draw IMDb data from an abundance of online URLs is accomplished with the "Thread" directional tool.

In [23]:
# Maximum number of threads that will be spawned
MAX_THREADS = 50

Below were the adjustments made with the focus of attributes of interest ranging from "movie_title_arr" to "image_id_arr."

In [24]:
movie_title_arr = []
movie_year_arr = []
movie_genre_arr = []
movie_synopsis_arr =[]
image_url_arr  = []
image_id_arr = []

Below was the phase of Scraping the elements above collected toward introducing us to Beautiful Soup.

In [25]:
def getMovieTitle(header):
    try:
        return header[0].find("a").getText()
    except:
        return 'NA'

def getReleaseYear(header):
    try:
        return header[0].find("span",  {"class": "lister-item-year text-muted unbold"}).getText()
    except:
        return 'NA'

def getGenre(muted_text):
    try:
        return muted_text.find("span",  {"class":  "genre"}).getText()
    except:
        return 'NA'

def getsynopsys(movie):
    try:
        return movie.find_all("p", {"class":  "text-muted"})[1].getText()
    except:
        return 'NA'

def getImage(image):
    try:
        return image.get('loadlate')
    except:
        return 'NA'

def getImageId(image):
    try:
        return image.get('data-tconst')
    except:
        return 'NA'

Through IMDb data, the foster function responsible for iteration is derived. Now, extraction occurs and formulates. URL elements are then explored and filtered to push this journey forward. 

In [26]:
def main(imdb_url):
    response = requests.get(imdb_url)
    soup = BeautifulSoup(response.text, 'html.parser')

    # Movie Name
    movies_list  = soup.find_all("div", {"class": "lister-item mode-advanced"})
    
    for movie in movies_list:
        header = movie.find_all("h3", {"class":  "lister-item-header"})
        muted_text = movie.find_all("p", {"class":  "text-muted"})[0]
        imageDiv =  movie.find("div", {"class": "lister-item-image float-left"})
        image = imageDiv.find("img", "loadlate")
        
        #  Movie Title
        movie_title =  getMovieTitle(header)
        movie_title_arr.append(movie_title)
        
        #  Movie release year
        year = getReleaseYear(header)
        movie_year_arr.append(year)
        
        #  Genre  of movie
        genre = getGenre(muted_text)
        movie_genre_arr.append(genre)
        
        # Movie Synopsys
        synopsis = getsynopsys(movie)
        movie_synopsis_arr.append(synopsis)
        
        #  Image attributes
        img_url = getImage(image)
        image_url_arr.append(img_url)
        
        image_id = image.get('data-tconst')
        image_id_arr.append(image_id)

Progression is realized assembled below in the relational URLs formulating filtered films. 

In [27]:
# An array to store all the URL that are being queried
imageArr = []

# Maximum number of pages one wants to iterate over
MAX_PAGE =51

# Loop to generate all the URLS.
for i in range(0,MAX_PAGE):
    totalRecords = 0 if i==0 else (250*i)+1
    print(totalRecords)
    imdb_url = f'https://www.imdb.com/search/title/?release_date=2020-01-02,2021-02-01&user_rating=4.0,10.0&languages=en&count=250&start={totalRecords}&ref_=adv_nxt'
    imageArr.append(imdb_url)

0
251
501
751
1001
1251
1501
1751
2001
2251
2501
2751
3001
3251
3501
3751
4001
4251
4501
4751
5001
5251
5501
5751
6001
6251
6501
6751
7001
7251
7501
7751
8001
8251
8501
8751
9001
9251
9501
9751
10001
10251
10501
10751
11001
11251
11501
11751
12001
12251
12501


The downloading function of "download_stories"  brings the URLs toward a primary function teamed up with the "MAX_THREADS" number of acceptance of requests.

In [28]:
def download_stories(story_urls):
    threads = min(MAX_THREADS, len(story_urls))
    with concurrent.futures.ThreadPoolExecutor(max_workers=threads) as executor:
        executor.map(main, story_urls)

In the end, the download function progresses our pursuit toward attaining the required and requested data.

In [29]:
# Call the download function with the array of URLS called imageArr
download_stories(imageArr)

# Attach all the data to the pandas dataframe. You can optionally write it to a CSV file as well
movieDf = pd.DataFrame({
    "Title": movie_title_arr,
    "Release_Year": movie_year_arr,
    "Genre": movie_genre_arr,
    "Synopsis": movie_synopsis_arr,
    "image_url": image_url_arr,
    "image_id": image_id_arr,
})

As a result, IMDB is attainable and readable toward analyzing a progressed data collection.

In [30]:
print('--------- Download Complete CSV Formed --------')

# movie.to_csv('file.csv', index=False) : If you want to store the file.
movieDf.head()


--------- Download Complete CSV Formed --------


Unnamed: 0,Title,Release_Year,Genre,Synopsis,image_url,image_id
0,The Great,(2020– ),"\nBiography, Comedy, Drama",\nA royal woman living in rural Russia during ...,https://m.media-amazon.com/images/M/MV5BYzVmOG...,tt2235759
1,Bruised,(2020),"\nDrama, Sport",\nA disgraced MMA fighter finds redemption in ...,https://m.media-amazon.com/images/M/MV5BMWRjZG...,tt8310474
2,Ted Lasso,(2020– ),"\nComedy, Drama, Sport",\nAmerican college football coach Ted Lasso he...,https://m.media-amazon.com/images/M/MV5BMDVmOD...,tt10986410
3,Locke & Key,(2020– ),"\nDrama, Fantasy, Horror",\nAfter their father is murdered under mysteri...,https://m.media-amazon.com/images/M/MV5BNmYyNW...,tt3007572
4,The Little Things,(2021),"\nCrime, Drama, Mystery",\nKern County Deputy Sheriff Joe Deacon is sen...,https://m.media-amazon.com/images/M/MV5BOGFlNT...,tt10016180


In [45]:
words = Synopsis.loc['Topic 2'].sort_values(ascending=False).head(18)
words 

from matplotlib import pyplot as plt
from wordcloud import WordCloud, STOPWORDS
import matplotlib.colors as mcolors

print(Synopsis.loc['Topic 2'].head(20))

# Create and generate a word cloud image:
wordcloud = WordCloud().generate_from_frequencies(words)

# Display the generated image:
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()

NameError: name 'Synopsis' is not defined