Skip to content

dsaadeh21/Geolocation-Prediction-from-Tweets-Yachay.ai

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 

Repository files navigation

image

Welcome to the repository for the Geolocation Prediction from Tweets project, developed during an externship at Yachay.ai. This project aims to predict the geolocation of tweets by leveraging deep learning techniques. Below is an overview of the project:

Project Overview

  • Preprocessed and performed Exploratory Data Analysis (EDA) on a dataset containing over 600,000 tweets.

  • Developed and trained a Keras functional API regression model that incorporates a deep learning approach using BERT for geolocation prediction.

  • Dataset and User Data

  • The dataset consists of tweet information, with each row containing the following user data:

  • 'text': The content of the tweet.

  • 'id': The unique ID of the tweet.

  • 'user_id': The ID of the user who posted the tweet.

  • 'cluster_id': The assigned ID related to the area from which the tweet was posted.

  • 'timestamp': The date and time when the tweet was posted.

  • 'lat': The latitude of the tweet.

  • 'lng': The longitude of the tweet.

Added Features

Additional features were incorporated into the model, including:

  • 'region': The region location of the tweet.
  • 'language': The language of the tweet.
  • 'tweet_day': The day the tweet was posted.
  • 'tweet_month': The month the tweet was posted.

#Model Inputs and Loss Metrics The model takes two inputs: 'text' and three different Natural Language Processing (NLP) features. These NLP features are:

  • 'tweet month': The month in which the tweet was posted.
  • 'tweet day': The day of the tweet.
  • 'language of tweet': The language in which the tweet was written.

The model's loss metrics include:

  • 'haversine_distance': Calculates the distance in kilometers.
  • 'mse': Mean Squared Error.

Results

The model was trained for 7 epochs, and the performance on the test set was as follows:

  • haversine_distance: 1334 KM
  • mse: 149 km squared

It is worth noting that a similar model trained solely on tweet text as input yielded higher loss and mse values, indicating that incorporating NLP features improves the overall performance.

Please explore the code and documentation within this repository for more details about the project and its implementation.

  • Note: The provided results are based on a specific experiment, and your results may vary depending on the dataset and configuration used.

image

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published