This project utilizes various libraries, including SpaCy, Geonamescache, and Instagrapi, to process text data with a specific focus on hashtags and graffiti-related content. The goal is to parse hashtags into meaningful words, identify entities, and classify terms related to graffiti, cities, and railroad lingo.
- Geonamescache is used for city and country information.
- Instagrapi handles interactions with Instagram data.
- SpaCy is used for natural language processing. Custom extensions are added to SpaCy's
Token
class to include properties likeis_city
,is_graffiti_lingo
, andis_railroad_lingo
.
The initialize_words
function loads wordlists from text files for:
- General vocabulary
- City names
- Graffiti lingo
- Railroad lingo
These wordlists are essential for parsing and classifying hashtags.
The pipeline processes hashtags using the following components:
Merges tokens starting with #
or @
into single tokens.
Adds extensions to tokens, identifying if they are hashtags or mentions.
Splits hashtags into meaningful words using a recursive approach.
Identifies graffiti-related entities like writers and crews, and looks for specific patterns or keywords in hashtags.
For example, the hashtag #freightgraffitiChicago
is processed as follows:
- The hashtag is stripped of the
#
symbol. - Words are identified sequentially from the beginning:
freight
graffiti
Chicago
- Each word is checked against predefined wordlists:
freight
matches general vocabulary.graffiti
matches graffiti lingo.Chicago
is identified as a city using Geonamescache.
-
Example:
#mecro
- The term
mecro
is identified as an out-of-vocabulary word (OOV). - If it is not part of the wordlist, it is classified as a graffiti writer.
- The term
-
Example:
#mskcrew
- The term
msk
is identified, and the suffixcrew
signifies it as a graffiti crew.
- The term
#blackbook
: Refers to sketchbooks used by graffiti artists to draft designs.#burnersonthestreet
: Highlights impressive or notable street graffiti.#vandalsMexico
: Represents graffiti or street art associated with Mexico.
Each of these hashtags undergoes the same processing pipeline to extract meaningful words and classify entities.
The parse_tag
function recursively identifies words within a hashtag:
- The input string is split by dashes or processed as a whole.
- Words are extracted iteratively from the start of the string.
- Each word is matched against the wordlist or categorized as an OOV entity.
Custom extensions allow the program to annotate tokens with additional metadata, such as:
is_city
: Whether the token represents a city.is_graffiti_lingo
: Whether the token is graffiti-related.custom_entity_cat
: Custom classification for unidentified entities.
The application can be containerized to create multiple instances using Docker. Below is the Dockerfile used for building the image:
FROM python:3.11.0b4-buster
ARG _USER="spacy"
ARG _UID="1001"
ARG _GID="100"
ARG _SHELL="/bin/bash"
# Install apt dependencies
RUN apt-get update && apt-get install -y \
nano \
git \
wget
RUN useradd -m -s "${_SHELL}" -N -u "${_UID}" "${_USER}"
ENV USER ${_USER}
ENV UID ${_UID}
ENV GID ${_GID}
ENV HOME /home/${_USER}
ENV PATH "${HOME}/.local/bin/:${PATH}"
ENV PIP_NO_CACHE_DIR "true"
RUN mkdir /home/${_USER}/app && chown ${UID}:${GID} /home/${_USER}/app
USER ${_USER}
COPY --chown=${UID}:${GID} config* /home/${_USER}/app
COPY --chown=${UID}:${GID} requirements* /home/${_USER}/app
COPY --chown=${UID}:${GID} ./* /home/${_USER}/app/
WORKDIR /home/${_USER}/app
RUN pip install -r requirements.txt
CMD bash
First, clone the repository:
git clone https://github.com/abundis-rmn2/Spacy-Hashtag-Geolocator.git
Run the following command to build the Docker image with the name hashtags
(you can use any name you prefer):
docker build -t hashtags .
To create and run a container, use the following command. In this case, we name the container hash1
:
docker run -it --name hash1 -d hashtags
If you want to keep the process running in a separate terminal session, you can use screen
:
screen -S hash
Run a specific script or test inside the container by specifying the container name and the MUID:
# docker exec -t [container_name] python test.py -MUID=[MUID]
docker exec -t hash1 python test.py -MUID=nearaxs_1_hashtagTop_9_48c69711
After executing the process, you will see output indicating the script has started:
Initialize Words wl file
<class 'list'>
185606
Initialize Words cities file
<class 'list'>
26797
Initialize Words graffiti-lingo file
<class 'list'>
551
Initialize Words railroad-lingo file
<class 'list'>
681
Looking for caption in MUID: fr8porn_1_hashtagTop_9_3bf76f18
MUID found : 513
- Scalability: Run multiple instances of the application by creating additional containers.
- Consistency: The Docker image ensures the environment remains consistent across deployments.
- Ease of Use: The containerized setup simplifies the process of setting up and running the application.
This project provides a structured way to analyze hashtags, focusing on graffiti-related content. It identifies cities, graffiti terms, and unique entities like writers and crews, enhancing the understanding of social media data in this niche.