# Personal Information
Name: **Anh Tran**

StudentID: **12770698**

Email: [**anh.tran1@student.uva.nl**](anh.tran1@student.uva.nl)

Submitted on: **DD.MM.YYYY**

# Data Context
**In this section you should introduce the datasources and datasets which you will be working with. Explain where they are from as well as their domain. Give an overview of what the context of the data is. You should not spend more than 1 to 2 paragraphs here as the core information will be in the next section.**

The topic of this research project is to explore gentrification - the process of a neighborhood changing as a result of wealthier residents moving in, bringing investments and physical improvements, but displacing existing residents as prices rise and cultures homogenized or replaced. This project examines the visual indicators of gentrification, more specifically in the signage of storefronts in Amsterdam, by applying computer vision methods on images of facades in the city.

The image dataset used in this project is from the [StreetSwipe project](http://streetswipe.aestheticsofexclusion.com/about.php). Via crowd-sourcing, the project let people decide which facade is gentrified, by voting "Yes" or "No" on the streetview images. The official *gentrified* and *non-gentrified* labels are generated based the majority of votes for each facade. Additionally, if subsequent voters decides against the majority, they are prompted to provide a textual explanation for their vote.

On this data, scene-text detection will be applied to identify the region of the images that contain storefront signage. With the text region extracted (still as an image), font recognition and color extraction will be done to understand these attributes; and text recognition will be applied to extract machine-readable text strings, whose semantic meanings will be studied using word embedding. This pipeline will be applied on gentrified and non-gentrified labelled subsets of the data, and ultimately the learnt attributes (fonts, colors, semantics) of these classes are compared to understand what is seen as gentrified.

# Data Description

**Present here the results of your exploratory data analysis. Note that there is no need to have a "story line" - it is more important that you show your understanding of the data and the methods that you will be using in your experiments (i.e. your methodology).**

**As an example, you could show data, label, or group balances, skewness, and basic characterizations of the data. Information about data frequency and distributions as well as results from reduction mechanisms such as PCA could be useful. Furthermore, indicate outliers and how/why you are taking them out of your samples, if you do so.**

**The idea is, that you conduct this analysis to a) understand the data better but b) also to verify the shapes of the distributions and whether they meet the assumptions of the methods that you will attempt to use. Finally, make good use of images, diagrams, and tables to showcase what information you have extracted from your data.**

As you can see, you are in a jupyter notebook environment here. This means that you should focus little on writing text and more on actually exploring your data. If you need to, you can use the amsmath environment in-line: $e=mc^2$ or also in separate equations such as here:

\begin{equation}
    e=mc^2 \mathrm{\space where \space} e,m,c\in \mathbb{R}
\end{equation}

Furthermore, you can insert images such as your data aggregation diagrams like this:

<!-- ![image](example.png) -->

In [6]:
# Imports
import os
import numpy as np
import pandas as pd
from glob import glob
# from bq_helper import BigQueryHelper
from dask import bag, diagnostics 
from urllib import request
import cv2
import missingno as msno
import hvplot.pandas  # custom install
from matplotlib import pyplot as plt
%matplotlib inline

### Data Loading

In [2]:
# Load your data here

### Label csv
Pre-processing (rename columns etc)
* Note difference in number of response in pre- and post- versions

### Images: 
Make sure to add some explanation of what you are doing in your code. This will help you and whoever will read this a lot in following your steps.

#### Sample size per class

#### Specs
* Size
* Dimensions
* Aspect ratios (width/height)
* Avg width and height
* Resolution
* Colors

In relation to models' img size requirements

#### Visual analysis
Noted that there are plenty of instances where non-gentrified facades contain no signage - something that can be directly concluded.

In [3]:
# Also don't forget to comment your code
# This way it's also easier to spot thought errors along the way

In [4]:
# ...