IR 2017

A simple information retrieval system using python3 and spark.

How to run the project

plz make sure you have installed related component such as MySQL, Redis, Spark and corresponding executed environment.

Firstly, you should run update/__init__.py, which would call crawler and then build indexes and model such as word's posting list and word co-occurrence model.

Secondly, plz run main.py, which would start a server and supply many service such as balabala.

Last but not least, run ir201712-front_end/search_engine.py to start the front-end server. OK, visit 127.0.0.1:5000 by browser to enjoy it.

How to install

git

install git
set initial params of git
git config --global user.name <github_name>
git config --global user.email <github_email>
git clone https://github.com/xuesu/ir201712.git

mysql

install mysql

In Ubuntu: sudo apt install mysql-server

open mysql terminal: (Attention, we should always use this charset 'UTF8mb4')
- Create a new db to avoid database-scale change in program: CREATE DATABASE ir character set UTF8mb4 collate utf8mb4_bin;
- Then create a test db: CREATE DATABASE ir_test character set UTF8mb4 collate utf8mb4_bin;
- Create a new user: CREATE USER 'IRDBA'@'localhost' IDENTIFIED BY 'complexpwd';
- Grant privilege to the user: GRANT ALL ON ir.* TO 'IRDBA'@'localhost';GRANT ALL ON ir_test.* TO 'IRDBA'@'localhost';

Redis

install Redis
- In Ubuntu: sudo apt-get install redis-server

virtualenv

install anaconda
build a new virtualenv conda create -n <env_name> python=3
activate the virtualenv source activate <env_name>
pip install -r requirements.list

spark

download & unzip https://www.apache.org/dyn/closer.lua/spark/spark-2.2.1/spark-2.2.1-bin-hadoop2.7.tgz

edit the path:

in ubuntu:

export SPARK_HOME="/XXXX/spark-2.2.1-bin-hadoop2.7"
export PYTHONPATH=$SPARK_HOME/python/:$PYTHONPATH
export PYTHONPATH=$SPARK_HOME/python/lib/py4j-0.9-src.zip:$PYTHONPATH

If you are using pyspark terminal, you can start now.
If you are using pycharm, you need add <spark_home>/python/pyspark & <spark_home>/python/lib/py4j-0.9-src.zip into content root.

emotions

It takes about 810MB memory, 囧

cd emotions
build a new virtualenv conda create -n <env_name2> python=2
- NOTE: This project is written in a different language!
activate the virtualenv source activate <env_name2>
pip install -r requirements.list
python demo_service.py

How to develop

Obey the basic coding rule if you can. BUT it is ok to write in your own style.
Try to write some test cases.
Always pull master and push dev!
git pull master
git add *
git commit -m "<my_change>"
git push <branch>:dev

How to read the code?

Plz try to read update/__init__.py for begining.

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
api		api
config		config
datasources		datasources
emotions		emotions
entities		entities
filters		filters
functions		functions
indexes		indexes
ir201712_front-end		ir201712_front-end
logs		logs
my_exceptions		my_exceptions
spiders		spiders
test		test
update		update
utils		utils
.gitignore		.gitignore
README.md		README.md
__init__.py		__init__.py
main.py		main.py
requirements.list		requirements.list
test.py		test.py

datalee/ir201712

Folders and files

Latest commit

History

Repository files navigation

IR 2017

How to run the project

How to install

git

mysql

Redis

virtualenv

spark

emotions

How to develop

How to read the code?

About

Resources

Stars

Watchers

Forks

Languages