# Advanced Module 3: Unstructured Data

This notebook contains a short overview of the *Unstructured Data* module for QSO370/QSO570. It is important to note that this is indeed only an overview. In order to master this content you'll be required to read and understand Chapters 19 and 20 of our textbook. You should expect to follow along with the examples in the text as well as to pursue additional tutorials to help you implement the techniques you've learned. Note that this module is a bit more *coding-heavy* than the other modules -- that being said, you'll learn some really useful and cutting edge tools here that will expand your capability to access and analyze an essentially unlimited pool of data.

## Module Overview

Most of the world's data is unstructured or semi-structured. Videos, audio files, images, and text are all data without a clear structure in their raw forms. In order to extract information from such data sources, we first need methods for formatting the data in a readable format -- and for large-scale projects, that format must be common and easily readable by a machine. Chapters 19 and 20 of our text will walk you through two rapidly evolving applications of machine learning with unstructured data: social media analytics, and text mining.

## Assignments

Note that all assignments are **individual** and that collaboration is not permitted. To be clear, <u>no document or code sharing is allowed</u>. If you have questions they can be discussed during the weekly discussion meetings with me and with classmates. I've also added a channel associated with this module to Slack and it is okay to have public conversations there. You should cite any sources you use outside of our textbook in completing assignments. All of the work you submit for this module must be your own -- copying and pasting from the textbook is not acceptable and no credit will be awarded for this. Reproducing a walkthrough or completing a web-based tutorial for the capstone section of this module will be considered academic dishonesty. Any such cases will be awarded no credit and will be reported to the Dean's Office.

The assignments associated with this module appear below.

1. Create a Jupyter Notebook which will serve as your set of notes for this module. You can either build a single Jupyter Notebook which covers all chapters in the module or build a Jupyter Notebook for each chapter individually. The choice is yours.
    + Your notebook should give an overview of the techniques it addresses.
    + Your notebook should discuss scenarios when these techniques can/should be applied. What are they used for?
    + You should include a section on relevant terminology. All definitions presented should include your own explanations of the corresponding term. You can either provide your own detailed definition or provide a definition from a source (with citation) and accompany that definition with your own explanations and/or examples.
    + You should discuss any specific data requirements for the techniques being discussed. What types of data do your techniques deal with? Are there any special pre-processing requirements.
    + Your notebook should cover what your technique *actually does* -- how does it work? 
    + Your notebook should discuss the results from your technique(s) and how they can be interepreted/operationalized.
2. You must complete the following *homework problems*.
  + **Problem 1:** Request and obtain an API key from *Twitter* (as well as at least one additional social media site you are interested in pulling data from). Once you've done this, write a short python script in a Jupyter Notebook that will pull the most recent 300 tweets using the hashtag `#analytics` or another hashtag of your choosing.  
  + **Problem 2:** In addition to accessing data via an API, another useful tool is to scrape data directly from the html code that produces a webpage. Complete the [tutorial here](https://towardsdatascience.com/web-scraping-job-postings-from-indeed-com-using-selenium-5ae58d155daf) about web-scraping for a job search. Substitute the location and search criteria to include job key-words and a location that you are interested in.
3. Capstone Project (see below).

## Capstone Project

This module has the most open-ended capstone option. Your project is likely to be truly novel, but you must clear a proposal with me first. A project proposal should be at most a single page in length, but could be as short as a paragraph. Your proposal should include between a few sentences and a paragraph describing the project you are proposing, as well as an outline of expected steps for you to complete the project satisfactorily. 

For example, a hypothetical student project could seek to construct a social media monitoring utility for *Home Depot*. This project could be justified because *Home Depot* is interested in real-time analytics on what customers are saying about the store and possibly DIY projects in general. Additionally, *Home Depot* may be interested in monitoring customer sentiment about its direct competitors (*Lowe's*, *Ace Hardware*, etc.). In order to complete this project the student would need to obtain an API-key for Twitter and possibly other social media sites (list the sites you will be using if that is your plan). A deliverable for this project might include (i) background documentation on the development of the tool -- where it seeks information, what it does with that information, and how that information benefits *Home Depot*, and (ii) several example daily or weekly briefings resulting from your tool. In the case of this project, these briefings might be designed as one- or two-page snapshots for C-suite executives (think a CMO or CEO) to get a quick and clear run-down of the most important happenings.

All capstone submissions should take the form of a formal report (the example above suffices since it is comprised of several smaller scale reports). It should be clear that this capstone is a significant undertaking. This project spans the final six weeks of our semester and includes work designed to span the entirety of that time period. Please do not plan to start this project in the final one, two, or even three weeks of the semester -- you will not be successful if you procrastinate on this assignment.